SWE-4002

Unbounded Growth (No Data Archiving)

Cantorian Technical Debt Magnitude: ℵ₁ (Systemic)

Description

Allowing tables to grow indefinitely without archival or cleanup strategies. Common with logs, audit records, and historical data where the assumption is "we might need it someday." Results in ever-growing tables that degrade performance, complicate maintenance, and eventually threaten system stability.

Illustrative Cantor Point

The Cantor Point occurs during initial design when deciding "we'll keep everything forever" or "we'll implement cleanup later." This deferral of data lifecycle planning creates an ever-growing burden that becomes harder to address as data accumulates.

Categories: CP-Schema, CP-API, CP-Process

Real-World Examples / Observed In

  • SaaS Platform: Audit log table grew to 500 million rows over 3 years, making queries timeout
  • E-commerce Site: Order history table became so large that backups couldn't complete in maintenance windows
  • Monitoring System: Metrics table grew until INSERT performance degraded, losing current data

Common Consequences & Impacts

Technical Impacts

  • - Query performance degradation
  • - Backup/restore times exceed windows
  • - Index maintenance becomes prohibitive
  • - Storage costs explosion

Human/Ethical Impacts

  • - User experience degradation
  • - Engineering burnout from operational issues
  • - Data hoarding without purpose
  • - Environmental impact of unnecessary storage

Business Impacts

  • - Increased operational costs
  • - Extended maintenance windows
  • - Feature development slowed
  • - Risk of data loss from failed backups

Recovery Difficulty & Escalation

5
3

ADI Principles & Axioms Violated

  • Principle of Invisible Decay: Growth is gradual until it's catastrophic
  • Principle of Fragile Stability: Works until it suddenly doesn't
  • Principle of Consequential Stewardship: Keeping all data isn't always ethical

Detection / 60-Second Audit

-- Find largest tables
SELECT 
    schemaname,
    tablename,
    pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) as total_size,
    pg_size_pretty(pg_relation_size(schemaname||'.'||tablename)) as table_size,
    pg_size_pretty(pg_indexes_size(schemaname||'.'||tablename)) as indexes_size,
    (SELECT COUNT(*) FROM information_schema.columns 
     WHERE table_schema = schemaname AND table_name = tablename) as column_count
FROM pg_tables
WHERE schemaname NOT IN ('pg_catalog', 'information_schema')
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC
LIMIT 20;

-- Check table growth rate
WITH table_sizes AS (
    SELECT 
        current_date as measurement_date,
        tablename,
        pg_total_relation_size(schemaname||'.'||tablename) as size_bytes
    FROM pg_tables
    WHERE schemaname = 'public'
)
SELECT 
    tablename,
    pg_size_pretty(size_bytes) as current_size,
    pg_size_pretty(size_bytes::bigint / 30) as daily_growth_estimate
FROM table_sizes
WHERE size_bytes > 1073741824  -- Tables over 1GB
ORDER BY size_bytes DESC;

-- Find tables with old data
SELECT 
    table_name,
    column_name,
    'SELECT MIN(' || column_name || '), MAX(' || column_name || 
    ') FROM ' || table_name || ';' as check_query
FROM information_schema.columns
WHERE table_schema = 'public'
AND data_type IN ('timestamp', 'date')
AND (column_name LIKE '%created%' OR column_name LIKE '%date%');
-- Find largest tables
SELECT 
    TABLE_SCHEMA,
    TABLE_NAME,
    ROUND(((DATA_LENGTH + INDEX_LENGTH) / 1024 / 1024), 2) AS total_size_mb,
    ROUND((DATA_LENGTH / 1024 / 1024), 2) AS data_size_mb,
    ROUND((INDEX_LENGTH / 1024 / 1024), 2) AS index_size_mb,
    TABLE_ROWS
FROM information_schema.TABLES
WHERE TABLE_SCHEMA NOT IN ('mysql', 'information_schema', 'performance_schema', 'sys')
ORDER BY (DATA_LENGTH + INDEX_LENGTH) DESC
LIMIT 20;

-- Estimate growth rate (requires periodic snapshots)
SELECT 
    TABLE_NAME,
    TABLE_ROWS,
    AUTO_INCREMENT,
    ROUND((DATA_LENGTH + INDEX_LENGTH) / 1024 / 1024, 2) as size_mb,
    ROUND((DATA_LENGTH + INDEX_LENGTH) / 1024 / 1024 / 30, 2) as daily_growth_estimate_mb
FROM information_schema.TABLES
WHERE TABLE_SCHEMA = DATABASE()
AND (DATA_LENGTH + INDEX_LENGTH) > 1073741824
ORDER BY (DATA_LENGTH + INDEX_LENGTH) DESC;

-- Generate queries to check date ranges
SELECT 
    TABLE_NAME,
    COLUMN_NAME,
    CONCAT('SELECT MIN(', COLUMN_NAME, '), MAX(', COLUMN_NAME, 
           ') FROM ', TABLE_NAME, ';') as check_query
FROM information_schema.COLUMNS
WHERE TABLE_SCHEMA = DATABASE()
AND DATA_TYPE IN ('timestamp', 'datetime', 'date')
AND (COLUMN_NAME LIKE '%created%' OR COLUMN_NAME LIKE '%date%');
-- Find largest tables
SELECT TOP 20
    s.name AS SchemaName,
    t.name AS TableName,
    SUM(p.rows) AS RowCounts,
    SUM(a.total_pages) * 8 / 1024 AS TotalSpaceMB,
    SUM(a.used_pages) * 8 / 1024 AS UsedSpaceMB,
    (SUM(a.total_pages) - SUM(a.used_pages)) * 8 / 1024 AS UnusedSpaceMB
FROM sys.tables t
INNER JOIN sys.schemas s ON t.schema_id = s.schema_id
INNER JOIN sys.indexes i ON t.object_id = i.object_id
INNER JOIN sys.partitions p ON i.object_id = p.object_id AND i.index_id = p.index_id
INNER JOIN sys.allocation_units a ON p.partition_id = a.container_id
WHERE t.is_ms_shipped = 0
GROUP BY s.name, t.name
ORDER BY TotalSpaceMB DESC;

-- Check for old data in tables
SELECT 
    TABLE_NAME,
    COLUMN_NAME,
    'SELECT MIN(' + COLUMN_NAME + ') AS OldestRecord, MAX(' + 
    COLUMN_NAME + ') AS NewestRecord FROM ' + 
    TABLE_SCHEMA + '.' + TABLE_NAME + ';' as check_query
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_SCHEMA = 'dbo'
AND DATA_TYPE IN ('datetime', 'datetime2', 'date')
AND (COLUMN_NAME LIKE '%created%' OR COLUMN_NAME LIKE '%date%');

Prevention & Mitigation Best Practices

  1. Define Retention Early: Establish data lifecycle policies from day one
  2. Partition by Time: Use date-based partitions for easy dropping
  3. Automated Archival: Regular jobs to move old data to cold storage
  4. Cascading Deletes: Ensure related data is cleaned up together
  5. Batch Deletions: Delete in small chunks to avoid lock issues
  6. Summary Tables: Keep aggregates instead of raw data
  7. Compliance Alignment: Match retention to legal requirements

Real World Examples

-- Problem: Audit logs growing forever
CREATE TABLE audit_logs (
    id BIGSERIAL PRIMARY KEY,
    created_at TIMESTAMP DEFAULT NOW(),
    user_id BIGINT,
    action VARCHAR(100),
    details JSONB
);

-- After 3 years: 500M rows, 800GB, queries timeout
-- Backups take 6+ hours, blocking maintenance windows
-- Problem: User sessions never cleaned up
CREATE TABLE user_sessions (
    session_id UUID PRIMARY KEY,
    user_id BIGINT,
    created_at TIMESTAMP DEFAULT NOW(),
    last_activity TIMESTAMP,
    data JSONB
);

-- Millions of dead sessions from years ago
-- Active session queries scan through ancient data
-- Solution: Time-based partitions with automated cleanup
CREATE TABLE audit_logs (
    id BIGSERIAL,
    created_at TIMESTAMP DEFAULT NOW(),
    user_id BIGINT,
    action VARCHAR(100),
    details JSONB,
    PRIMARY KEY (id, created_at)
) PARTITION BY RANGE (created_at);

-- Create monthly partitions
CREATE TABLE audit_logs_2024_01 PARTITION OF audit_logs
    FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');

-- Automated cleanup function
CREATE OR REPLACE FUNCTION cleanup_old_audit_logs()
RETURNS void AS $
DECLARE
    cutoff_date DATE;
    partition_name TEXT;
BEGIN
    cutoff_date := CURRENT_DATE - INTERVAL '90 days';
    
    FOR partition_name IN 
        SELECT tablename 
        FROM pg_tables 
        WHERE schemaname = 'public' 
        AND tablename ~ '^audit_logs_\d{4}_\d{2}{{{realWorldExamples}}}#039;
        AND tablename < 'audit_logs_' || TO_CHAR(cutoff_date, 'YYYY_MM')
    LOOP
        -- Archive to S3/cold storage first if needed
        EXECUTE format('DROP TABLE %I CASCADE', partition_name);
        RAISE NOTICE 'Dropped old partition: %', partition_name;
    END LOOP;
END;
$ LANGUAGE plpgsql;

-- Schedule weekly: SELECT cleanup_old_audit_logs();

-- For session cleanup, use TTL
CREATE INDEX idx_sessions_last_activity ON user_sessions(last_activity);

-- Regular cleanup job
DELETE FROM user_sessions 
WHERE last_activity < NOW() - INTERVAL '30 days'
LIMIT 10000;  -- Batch to avoid locks

AI Coding Guidance/Prompt

Prompt: "When designing tables that accumulate data over time:"
Rules:
  - Require retention policy definition
  - Suggest partitioning for time-series data
  - Flag tables without archival strategy
  - Recommend cleanup job creation
  
Example:
  # Bad: Infinite growth table
  CREATE TABLE audit_logs (
    id BIGINT PRIMARY KEY,
    created_at TIMESTAMP DEFAULT NOW(),
    user_id BIGINT,
    action TEXT,
    details JSONB
  );
  -- No partition, no retention policy!
  
  # Good: Partitioned with retention
  CREATE TABLE audit_logs (
    id BIGINT,
    created_at TIMESTAMP DEFAULT NOW(),
    user_id BIGINT,
    action TEXT,
    details JSONB
  ) PARTITION BY RANGE (created_at);
  
  -- Monthly partitions
  CREATE TABLE audit_logs_2024_01 PARTITION OF audit_logs
    FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');
  
  -- Automated cleanup job
  CREATE OR REPLACE FUNCTION drop_old_audit_partitions()
  RETURNS void AS $
  DECLARE
    partition_name TEXT;
  BEGIN
    FOR partition_name IN 
      SELECT tablename 
      FROM pg_tables 
      WHERE tablename LIKE 'audit_logs_%'
      AND tablename < 'audit_logs_' || 
          TO_CHAR(CURRENT_DATE - INTERVAL '6 months', 'YYYY_MM')
    LOOP
      EXECUTE format('DROP TABLE IF EXISTS %I', partition_name);
      RAISE NOTICE 'Dropped partition: %', partition_name;
    END LOOP;
  END;
  $ LANGUAGE plpgsql;
  
  -- Schedule monthly: SELECT drop_old_audit_partitions();

Relevant Keywords

unbounded growth data archiving) Symptoms: slow queries, data inconsistency, constraint violations Preventive: schema validation, constraint enforcement, proper typing Tech stack: PostgreSQL, MySQL, SQL Server, Oracle Industry: all industries, enterprise, SaaS

Related Patterns

The Cantorian Technical Debt Magnitude scale gives developers an intuitive sense of magnitude beyond simple hour counts - some debts aren't just larger in scale, but qualitatively different in their complexity.

Cantor Points are critical decision junctures—or even moments of non-decision—where seemingly small choices can create drastically divergent futures for a system's integrity, security, and evolvability. These are the "forks in the road" where one path might lead to manageable complexity, while another veers towards systemic entanglement or even chaos. They often appear trivial at the time but can set in motion irreversible or costly-to-reverse consequences.

Applied Data Integrity (ADI) is a framework to understanding the far-reaching consequences of schema and data decisions that impact security and reliability, and accumulate into ethical debt that affects real human lives. Built on research from real-world incidents, ADI uncovered 7 Principles to identify when these decisions are being made, and how to make them better, to avoid future technical debt and potentially catastrophic "butterfly effects" of small decisions that ripple into chaotic technical and ethical debt.