SWE-4002: Unbounded Growth (No Data Archiving)

Description

Allowing tables to grow indefinitely without archival or cleanup strategies. Common with logs, audit records, and historical data where the assumption is "we might need it someday." Results in ever-growing tables that degrade performance, complicate maintenance, and eventually threaten system stability.

Illustrative Cantor Point

The Cantor Point occurs during initial design when deciding "we'll keep everything forever" or "we'll implement cleanup later." This deferral of data lifecycle planning creates an ever-growing burden that becomes harder to address as data accumulates.

Categories: CP-Schema, CP-API, CP-Process

Real-World Examples / Observed In

SaaS Platform: Audit log table grew to 500 million rows over 3 years, making queries timeout
E-commerce Site: Order history table became so large that backups couldn't complete in maintenance windows
Monitoring System: Metrics table grew until INSERT performance degraded, losing current data

Common Consequences & Impacts

Technical Impacts

- Query performance degradation
- Backup/restore times exceed windows
- Index maintenance becomes prohibitive
- Storage costs explosion

Human/Ethical Impacts

- User experience degradation
- Engineering burnout from operational issues
- Data hoarding without purpose
- Environmental impact of unnecessary storage

Business Impacts

- Increased operational costs
- Extended maintenance windows
- Feature development slowed
- Risk of data loss from failed backups

Recovery Difficulty & Escalation

Recovery Difficulty:

5

Escalation Risk:

3

ADI Principles & Axioms Violated

Principle of Invisible Decay: Growth is gradual until it's catastrophic
Principle of Fragile Stability: Works until it suddenly doesn't
Principle of Consequential Stewardship: Keeping all data isn't always ethical

Detection / 60-Second Audit

-- Find largest tables
SELECT 
    schemaname,
    tablename,
    pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) as total_size,
    pg_size_pretty(pg_relation_size(schemaname||'.'||tablename)) as table_size,
    pg_size_pretty(pg_indexes_size(schemaname||'.'||tablename)) as indexes_size,
    (SELECT COUNT(*) FROM information_schema.columns 
     WHERE table_schema = schemaname AND table_name = tablename) as column_count
FROM pg_tables
WHERE schemaname NOT IN ('pg_catalog', 'information_schema')
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC
LIMIT 20;

-- Check table growth rate
WITH table_sizes AS (
    SELECT 
        current_date as measurement_date,
        tablename,
        pg_total_relation_size(schemaname||'.'||tablename) as size_bytes
    FROM pg_tables
    WHERE schemaname = 'public'
)
SELECT 
    tablename,
    pg_size_pretty(size_bytes) as current_size,
    pg_size_pretty(size_bytes::bigint / 30) as daily_growth_estimate
FROM table_sizes
WHERE size_bytes > 1073741824  -- Tables over 1GB
ORDER BY size_bytes DESC;

-- Find tables with old data
SELECT 
    table_name,
    column_name,
    'SELECT MIN(' || column_name || '), MAX(' || column_name || 
    ') FROM ' || table_name || ';' as check_query
FROM information_schema.columns
WHERE table_schema = 'public'
AND data_type IN ('timestamp', 'date')
AND (column_name LIKE '%created%' OR column_name LIKE '%date%');

-- Find largest tables
SELECT 
    TABLE_SCHEMA,
    TABLE_NAME,
    ROUND(((DATA_LENGTH + INDEX_LENGTH) / 1024 / 1024), 2) AS total_size_mb,
    ROUND((DATA_LENGTH / 1024 / 1024), 2) AS data_size_mb,
    ROUND((INDEX_LENGTH / 1024 / 1024), 2) AS index_size_mb,
    TABLE_ROWS
FROM information_schema.TABLES
WHERE TABLE_SCHEMA NOT IN ('mysql', 'information_schema', 'performance_schema', 'sys')
ORDER BY (DATA_LENGTH + INDEX_LENGTH) DESC
LIMIT 20;

-- Estimate growth rate (requires periodic snapshots)
SELECT 
    TABLE_NAME,
    TABLE_ROWS,
    AUTO_INCREMENT,
    ROUND((DATA_LENGTH + INDEX_LENGTH) / 1024 / 1024, 2) as size_mb,
    ROUND((DATA_LENGTH + INDEX_LENGTH) / 1024 / 1024 / 30, 2) as daily_growth_estimate_mb
FROM information_schema.TABLES
WHERE TABLE_SCHEMA = DATABASE()
AND (DATA_LENGTH + INDEX_LENGTH) > 1073741824
ORDER BY (DATA_LENGTH + INDEX_LENGTH) DESC;

-- Generate queries to check date ranges
SELECT 
    TABLE_NAME,
    COLUMN_NAME,
    CONCAT('SELECT MIN(', COLUMN_NAME, '), MAX(', COLUMN_NAME, 
           ') FROM ', TABLE_NAME, ';') as check_query
FROM information_schema.COLUMNS
WHERE TABLE_SCHEMA = DATABASE()
AND DATA_TYPE IN ('timestamp', 'datetime', 'date')
AND (COLUMN_NAME LIKE '%created%' OR COLUMN_NAME LIKE '%date%');

-- Find largest tables
SELECT TOP 20
    s.name AS SchemaName,
    t.name AS TableName,
    SUM(p.rows) AS RowCounts,
    SUM(a.total_pages) * 8 / 1024 AS TotalSpaceMB,
    SUM(a.used_pages) * 8 / 1024 AS UsedSpaceMB,
    (SUM(a.total_pages) - SUM(a.used_pages)) * 8 / 1024 AS UnusedSpaceMB
FROM sys.tables t
INNER JOIN sys.schemas s ON t.schema_id = s.schema_id
INNER JOIN sys.indexes i ON t.object_id = i.object_id
INNER JOIN sys.partitions p ON i.object_id = p.object_id AND i.index_id = p.index_id
INNER JOIN sys.allocation_units a ON p.partition_id = a.container_id
WHERE t.is_ms_shipped = 0
GROUP BY s.name, t.name
ORDER BY TotalSpaceMB DESC;

-- Check for old data in tables
SELECT 
    TABLE_NAME,
    COLUMN_NAME,
    'SELECT MIN(' + COLUMN_NAME + ') AS OldestRecord, MAX(' + 
    COLUMN_NAME + ') AS NewestRecord FROM ' + 
    TABLE_SCHEMA + '.' + TABLE_NAME + ';' as check_query
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_SCHEMA = 'dbo'
AND DATA_TYPE IN ('datetime', 'datetime2', 'date')
AND (COLUMN_NAME LIKE '%created%' OR COLUMN_NAME LIKE '%date%');

Prevention & Mitigation Best Practices

Define Retention Early: Establish data lifecycle policies from day one
Partition by Time: Use date-based partitions for easy dropping
Automated Archival: Regular jobs to move old data to cold storage
Cascading Deletes: Ensure related data is cleaned up together
Batch Deletions: Delete in small chunks to avoid lock issues
Summary Tables: Keep aggregates instead of raw data
Compliance Alignment: Match retention to legal requirements

Real World Examples

-- Problem: Audit logs growing forever
CREATE TABLE audit_logs (
    id BIGSERIAL PRIMARY KEY,
    created_at TIMESTAMP DEFAULT NOW(),
    user_id BIGINT,
    action VARCHAR(100),
    details JSONB
);

-- After 3 years: 500M rows, 800GB, queries timeout
-- Backups take 6+ hours, blocking maintenance windows

-- Problem: User sessions never cleaned up
CREATE TABLE user_sessions (
    session_id UUID PRIMARY KEY,
    user_id BIGINT,
    created_at TIMESTAMP DEFAULT NOW(),
    last_activity TIMESTAMP,
    data JSONB
);

-- Millions of dead sessions from years ago
-- Active session queries scan through ancient data

-- Solution: Time-based partitions with automated cleanup
CREATE TABLE audit_logs (
    id BIGSERIAL,
    created_at TIMESTAMP DEFAULT NOW(),
    user_id BIGINT,
    action VARCHAR(100),
    details JSONB,
    PRIMARY KEY (id, created_at)
) PARTITION BY RANGE (created_at);

-- Create monthly partitions
CREATE TABLE audit_logs_2024_01 PARTITION OF audit_logs
    FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');

-- Automated cleanup function
CREATE OR REPLACE FUNCTION cleanup_old_audit_logs()
RETURNS void AS $
DECLARE
    cutoff_date DATE;
    partition_name TEXT;
BEGIN
    cutoff_date := CURRENT_DATE - INTERVAL '90 days';
    
    FOR partition_name IN 
        SELECT tablename 
        FROM pg_tables 
        WHERE schemaname = 'public' 
        AND tablename ~ '^audit_logs_\d{4}_\d{2}{{{realWorldExamples}}}#039;
        AND tablename < 'audit_logs_' || TO_CHAR(cutoff_date, 'YYYY_MM')
    LOOP
        -- Archive to S3/cold storage first if needed
        EXECUTE format('DROP TABLE %I CASCADE', partition_name);
        RAISE NOTICE 'Dropped old partition: %', partition_name;
    END LOOP;
END;
$ LANGUAGE plpgsql;

-- Schedule weekly: SELECT cleanup_old_audit_logs();

-- For session cleanup, use TTL
CREATE INDEX idx_sessions_last_activity ON user_sessions(last_activity);

-- Regular cleanup job
DELETE FROM user_sessions 
WHERE last_activity < NOW() - INTERVAL '30 days'
LIMIT 10000;  -- Batch to avoid locks

AI Coding Guidance/Prompt

Prompt: "When designing tables that accumulate data over time:"
Rules:
  - Require retention policy definition
  - Suggest partitioning for time-series data
  - Flag tables without archival strategy
  - Recommend cleanup job creation
  
Example:
  # Bad: Infinite growth table
  CREATE TABLE audit_logs (
    id BIGINT PRIMARY KEY,
    created_at TIMESTAMP DEFAULT NOW(),
    user_id BIGINT,
    action TEXT,
    details JSONB
  );
  -- No partition, no retention policy!
  
  # Good: Partitioned with retention
  CREATE TABLE audit_logs (
    id BIGINT,
    created_at TIMESTAMP DEFAULT NOW(),
    user_id BIGINT,
    action TEXT,
    details JSONB
  ) PARTITION BY RANGE (created_at);
  
  -- Monthly partitions
  CREATE TABLE audit_logs_2024_01 PARTITION OF audit_logs
    FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');
  
  -- Automated cleanup job
  CREATE OR REPLACE FUNCTION drop_old_audit_partitions()
  RETURNS void AS $
  DECLARE
    partition_name TEXT;
  BEGIN
    FOR partition_name IN 
      SELECT tablename 
      FROM pg_tables 
      WHERE tablename LIKE 'audit_logs_%'
      AND tablename < 'audit_logs_' || 
          TO_CHAR(CURRENT_DATE - INTERVAL '6 months', 'YYYY_MM')
    LOOP
      EXECUTE format('DROP TABLE IF EXISTS %I', partition_name);
      RAISE NOTICE 'Dropped partition: %', partition_name;
    END LOOP;
  END;
  $ LANGUAGE plpgsql;
  
  -- Schedule monthly: SELECT drop_old_audit_partitions();

Relevant Keywords

unbounded growth data archiving) Symptoms: slow queries, data inconsistency, constraint violations Preventive: schema validation, constraint enforcement, proper typing Tech stack: PostgreSQL, MySQL, SQL Server, Oracle Industry: all industries, enterprise, SaaS