SWE-5005

Migration Backup and Recovery Weakness

Cantorian Technical Debt Magnitude: ℵ₁ (Systemic)

Description

Migration processes executed without verified backup and recovery capabilities. This includes untested backups, missing recovery procedures, lack of backup integrity verification, or proceeding with irreversible operations before confirming data recoverability. Often discovered only after catastrophic data loss has occurred.

Illustrative Cantor Point

The Cantor Point occurs when planning migrations - choosing between thorough backup verification that delays the migration versus proceeding with assumptions about backup viability. The decision to skip backup testing creates a divergent path where migration failures become permanent data loss rather than temporary setbacks.

Categories: CP-Schema, CP-API, CP-Process

Real-World Examples / Observed In

  • Myspace (2019): Lost 50 million songs and 12 years of user content during migration due to untested backups [See: Cases-By-Year/2019 Data Integrity Failures.md#3]
  • GitLab (2017): Nearly lost entire production database, saved only by an accidental backup
  • Toy Story 2 (1998): Pixar almost lost the entire movie, saved by animator's home backup

Common Consequences & Impacts

Technical Impacts

  • - Permanent data loss
  • - Inability to rollback
  • - Corrupted data states
  • - Lost system configurations

Human/Ethical Impacts

  • - Lost memories/content
  • - Creative work destruction
  • - Historical record gaps
  • - Trust violation

Business Impacts

  • - Irreplaceable content loss
  • - Business continuity failure
  • - Legal liability
  • - Reputation destruction

Recovery Difficulty & Escalation

9.5
10

ADI Principles & Axioms Violated

  • Principle of Temporal Integrity: Past states must remain recoverable
  • Principle of Verified Trust: Never trust untested backups

Detection / 60-Second Audit

```sql
-- Check for backup tracking
SELECT 
    EXISTS (
        SELECT 1 FROM information_schema.tables
        WHERE table_name IN ('backup_history', 'backup_tests', 'recovery_tests')
    ) as has_backup_tracking;

-- Identify tables without recent backups
WITH backup_coverage AS (
    SELECT 
        schemaname,
        tablename,
        pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) as table_size,
        pg_stat_get_last_analyze_time(c.oid) as last_analyzed
    FROM pg_stat_user_tables
    JOIN pg_class c ON c.relname = tablename
)
SELECT 
    schemaname,
    tablename,
    table_size,
    CASE 
        WHEN last_analyzed IS NULL THEN 'NEVER ANALYZED - Backup status unknown'
        WHEN last_analyzed < NOW() - INTERVAL '30 days' THEN 'Stale statistics - verify backup'
        ELSE 'Recently analyzed'
    END as backup_concern
FROM backup_coverage
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;

-- Check for migration planning tables
SELECT 
    table_name,
    column_name,
    CASE 
        WHEN column_name LIKE '%backup%' THEN 'Has backup reference'
        WHEN column_name LIKE '%rollback%' THEN 'Has rollback planning'
        WHEN column_name LIKE '%verified%' THEN 'Has verification tracking'
        ELSE 'Check column purpose'
    END as migration_safety_feature
FROM information_schema.columns
WHERE table_name LIKE '%migration%'
AND (column_name LIKE '%backup%' 
     OR column_name LIKE '%rollback%' 
     OR column_name LIKE '%verif%');
```
```sql
-- Check for backup tracking
SELECT 
    EXISTS (
        SELECT 1 FROM information_schema.tables
        WHERE table_schema = DATABASE()
        AND table_name IN ('backup_history', 'backup_tests', 'recovery_tests')
    ) as has_backup_tracking;

-- Identify large tables (backup priority)
SELECT 
    table_schema,
    table_name,
    ROUND(data_length/1024/1024, 2) as table_size_mb,
    ROUND(index_length/1024/1024, 2) as index_size_mb,
    table_rows,
    CASE 
        WHEN update_time IS NULL THEN 'Never updated - check backup coverage'
        WHEN update_time < NOW() - INTERVAL 30 DAY THEN 'Not recently updated'
        ELSE 'Recently updated - ensure backup current'
    END as backup_priority
FROM information_schema.tables
WHERE table_schema = DATABASE()
AND table_type = 'BASE TABLE'
ORDER BY data_length DESC
LIMIT 20;

-- Check for migration planning tables
SELECT 
    table_name,
    column_name,
    CASE 
        WHEN column_name LIKE '%backup%' THEN 'Has backup reference'
        WHEN column_name LIKE '%rollback%' THEN 'Has rollback planning'
        WHEN column_name LIKE '%verified%' THEN 'Has verification tracking'
        ELSE 'Check column purpose'
    END as migration_safety_feature
FROM information_schema.columns
WHERE table_schema = DATABASE()
AND table_name LIKE '%migration%'
AND (column_name LIKE '%backup%' 
     OR column_name LIKE '%rollback%' 
     OR column_name LIKE '%verif%');
```
```sql
-- Check for backup tracking
SELECT 
    CASE WHEN EXISTS (
        SELECT 1 FROM sys.tables
        WHERE name IN ('backup_history', 'backup_tests', 'recovery_tests')
    ) THEN 1 ELSE 0 END as has_backup_tracking;

-- Check backup history
SELECT TOP 10
    database_name,
    backup_start_date,
    backup_finish_date,
    type as backup_type,
    DATEDIFF(hour, backup_finish_date, GETDATE()) as hours_since_backup,
    CASE 
        WHEN DATEDIFF(hour, backup_finish_date, GETDATE()) > 24 THEN 'OVERDUE'
        WHEN DATEDIFF(hour, backup_finish_date, GETDATE()) > 12 THEN 'WARNING'
        ELSE 'OK'
    END as backup_status
FROM msdb.dbo.backupset
WHERE database_name = DB_NAME()
ORDER BY backup_start_date DESC;

-- Check for migration planning tables
SELECT 
    t.name as table_name,
    c.name as column_name,
    CASE 
        WHEN c.name LIKE '%backup%' THEN 'Has backup reference'
        WHEN c.name LIKE '%rollback%' THEN 'Has rollback planning'
        WHEN c.name LIKE '%verified%' THEN 'Has verification tracking'
        ELSE 'Check column purpose'
    END as migration_safety_feature
FROM sys.tables t
JOIN sys.columns c ON t.object_id = c.object_id
WHERE t.name LIKE '%migration%'
AND (c.name LIKE '%backup%' 
     OR c.name LIKE '%rollback%' 
     OR c.name LIKE '%verif%');

Prevention & Mitigation Best Practices

  1. Comprehensive Backup Management Schema:

    CREATE TABLE backup_catalog (
        id SERIAL PRIMARY KEY,
        backup_name VARCHAR(255) UNIQUE NOT NULL,
        backup_type VARCHAR(50) CHECK (backup_type IN ('full', 'incremental', 'differential')),
        source_database VARCHAR(255) NOT NULL,
        backup_start_time TIMESTAMP WITH TIME ZONE NOT NULL,
        backup_end_time TIMESTAMP WITH TIME ZONE,
        backup_size_bytes BIGINT,
        backup_location TEXT NOT NULL,
        encryption_key_id VARCHAR(255),
        compression_type VARCHAR(50),
        retention_days INTEGER DEFAULT 90,
        is_verified BOOLEAN DEFAULT false,
        verification_date TIMESTAMP WITH TIME ZONE,
        created_by VARCHAR(255) NOT NULL
    );
    
    CREATE TABLE backup_verification_tests (
        id SERIAL PRIMARY KEY,
        backup_id INTEGER REFERENCES backup_catalog(id),
        test_type VARCHAR(50) CHECK (test_type IN ('integrity', 'restore', 'partial_restore', 'data_validation')),
        test_start_time TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
        test_end_time TIMESTAMP WITH TIME ZONE,
        test_status VARCHAR(50),
        test_details JSONB,
        tested_by VARCHAR(255),
        test_environment VARCHAR(255),
        rows_verified BIGINT,
        checksum_match BOOLEAN
    );
    
    CREATE TABLE recovery_procedures (
        id SERIAL PRIMARY KEY,
        procedure_name VARCHAR(255) UNIQUE NOT NULL,
        procedure_type VARCHAR(50),
        estimated_recovery_time INTERVAL,
        last_tested_date DATE,
        test_success_rate DECIMAL(5,2),
        procedure_steps JSONB,
        required_resources TEXT[],
        contact_list JSONB
    );
    
  2. Migration Safety Framework:

    CREATE TABLE migration_safety_checklist (
        id SERIAL PRIMARY KEY,
        migration_id UUID NOT NULL,
        checklist_item VARCHAR(255) NOT NULL,
        is_required BOOLEAN DEFAULT true,
        completed BOOLEAN DEFAULT false,
        completed_by VARCHAR(255),
        completed_at TIMESTAMP WITH TIME ZONE,
        evidence TEXT,
        UNIQUE(migration_id, checklist_item)
    );
    
    -- Mandatory checklist items
    INSERT INTO migration_safety_checklist (migration_id, checklist_item) VALUES
    (gen_random_uuid(), 'Full backup completed'),
    (gen_random_uuid(), 'Backup verified restorable'),
    (gen_random_uuid(), 'Recovery procedure documented'),
    (gen_random_uuid(), 'Recovery time tested'),
    (gen_random_uuid(), 'Rollback plan approved'),
    (gen_random_uuid(), 'Data integrity queries prepared'),
    (gen_random_uuid(), 'Stakeholders notified'),
    (gen_random_uuid(), 'Maintenance window scheduled');
    
    -- Enforce checklist completion
    CREATE OR REPLACE FUNCTION can_proceed_with_migration(
        p_migration_id UUID
    ) RETURNS BOOLEAN AS $
    DECLARE
        v_incomplete_required INTEGER;
    BEGIN
        SELECT COUNT(*)
        INTO v_incomplete_required
        FROM migration_safety_checklist
        WHERE migration_id = p_migration_id
        AND is_required = true
        AND completed = false;
        
        RETURN v_incomplete_required = 0;
    END;
    $ LANGUAGE plpgsql;
    
  3. Backup Testing Automation:

    CREATE OR REPLACE FUNCTION test_backup_recovery(
        p_backup_id INTEGER
    ) RETURNS TABLE (
        test_phase VARCHAR,
        status VARCHAR,
        details TEXT
    ) AS $
    DECLARE
        v_backup backup_catalog;
        v_test_db VARCHAR;
        v_start_time TIMESTAMP WITH TIME ZONE;
    BEGIN
        SELECT * INTO v_backup FROM backup_catalog WHERE id = p_backup_id;
        v_test_db := 'restore_test_' || p_backup_id;
        v_start_time := NOW();
        
        -- Phase 1: Create test database
        RETURN QUERY
        SELECT 'create_test_db'::VARCHAR, 
               'started'::VARCHAR, 
               'Creating test database ' || v_test_db::TEXT;
        
        EXECUTE 'CREATE DATABASE ' || quote_ident(v_test_db);
        
        -- Phase 2: Restore backup
        RETURN QUERY
        SELECT 'restore_backup'::VARCHAR,
               'started'::VARCHAR,
               'Restoring from ' || v_backup.backup_location::TEXT;
        
        -- (Actual restore command would go here)
        
        -- Phase 3: Verify data integrity
        RETURN QUERY
        SELECT 'verify_integrity'::VARCHAR,
               'started'::VARCHAR,
               'Running integrity checks'::TEXT;
        
        -- Phase 4: Test specific queries
        RETURN QUERY
        SELECT 'test_queries'::VARCHAR,
               'started'::VARCHAR,
               'Testing critical queries'::TEXT;
        
        -- Log results
        INSERT INTO backup_verification_tests (
            backup_id,
            test_type,
            test_start_time,
            test_end_time,
            test_status,
            test_environment
        ) VALUES (
            p_backup_id,
            'restore',
            v_start_time,
            NOW(),
            'completed',
            v_test_db
        );
        
        -- Cleanup
        EXECUTE 'DROP DATABASE IF EXISTS ' || quote_ident(v_test_db);
        
        RETURN QUERY
        SELECT 'cleanup'::VARCHAR,
               'completed'::VARCHAR,
               'Test database removed'::TEXT;
    END;
    $ LANGUAGE plpgsql;
    
  4. Point-in-Time Recovery Tracking:

    CREATE TABLE pitr_capability (
        id SERIAL PRIMARY KEY,
        database_name VARCHAR(255),
        earliest_recovery_point TIMESTAMP WITH TIME ZONE,
        latest_recovery_point TIMESTAMP WITH TIME ZONE,
        wal_archiving_enabled BOOLEAN,
        archive_location TEXT,
        retention_policy_days INTEGER,
        last_verification_date DATE,
        verification_status VARCHAR(50)
    );
    
    -- Monitor recovery capability gaps
    CREATE VIEW recovery_capability_gaps AS
    SELECT 
        database_name,
        NOW() - latest_recovery_point as recovery_point_lag,
        CASE 
            WHEN NOW() - latest_recovery_point > INTERVAL '1 hour' THEN 'CRITICAL'
            WHEN NOW() - latest_recovery_point > INTERVAL '15 minutes' THEN 'WARNING'
            ELSE 'OK'
        END as status,
        CASE 
            WHEN NOT wal_archiving_enabled THEN 'WAL archiving disabled!'
            WHEN last_verification_date < CURRENT_DATE - 7 THEN 'Verification overdue!'
            ELSE 'Normal'
        END as issues
    FROM pitr_capability
    ORDER BY recovery_point_lag DESC;
    
  5. Additional Best Practices:

    • Implement 3-2-1 backup rule (3 copies, 2 different media, 1 offsite)
    • Test recovery procedures quarterly minimum
    • Document recovery time objectives (RTO) and recovery point objectives (RPO)
    • Use immutable backups to prevent ransomware
    • Implement automated backup verification
    • Create "chaos recovery" drills

Real World Examples

# What happened:
- Server migration project initiated
- 50 million songs from 2003-2015
- 12 years of user-uploaded content
- Photos, videos, and audio files

# The failure chain:
1. Migration team assumed backups existed
2. No backup verification performed
3. Data copied to new servers
4. Old servers decommissioned
5. Corruption discovered on new servers
6. Backups either corrupted or never existed
7. Data permanently lost

# Impact:
- 14 million artists' work gone forever
- 53 million songs erased
- Cultural history destroyed
- Class action lawsuits
- MySpace's remaining reputation destroyed

# What they said:
"As a result of a server migration project, 
any photos, videos, and audio files you 
uploaded more than three years ago may 
no longer be available."
# The near-disaster timeline:
# 2017-01-31 - Production database under heavy load

# Engineer action (trying to fix replication):
$ sudo gitlab-ctl reconfigure
$ sudo gitlab-ctl restart postgresql

# Mistake: Ran on PRODUCTION instead of staging
$ sudo rm -rf /var/opt/gitlab/postgresql/data
# Realized mistake immediately but too late

# Backup status discovered:
- Backup 1: Not enabled (configuration error)
- Backup 2: Failing silently for months
- Backup 3: Only uploading empty files
- Backup 4: Not turned on
- Backup 5: Outdated (6 hours old)

# Saved by:
- One engineer's manual backup from 6 hours earlier
- Frantically copying before more data lost

# Lessons:
- Multiple backup systems ALL failed
- Nobody was testing restore procedures
- Monitoring wasn't alerting on backup failures
# Safe migration implementation:
class SafeMigrationManager:
    def __init__(self):
        self.backup_manager = BackupManager()
        self.validator = DataValidator()
        
    async def execute_migration(self, migration_plan):
        # Step 1: Create and verify backup
        backup_id = await self.create_verified_backup()
        
        # Step 2: Test restore capability
        if not await self.test_restore(backup_id):
            raise MigrationError("Backup restore test failed")
            
        # Step 3: Create migration checkpoint
        checkpoint = await self.create_checkpoint()
        
        # Step 4: Execute with monitoring
        try:
            # Take pre-migration snapshot
            pre_snapshot = await self.validator.create_snapshot()
            
            # Run migration in transaction
            async with self.db.transaction() as tx:
                await migration_plan.execute()
                
                # Verify data integrity
                post_snapshot = await self.validator.create_snapshot()
                if not self.validator.compare_snapshots(pre_snapshot, post_snapshot):
                    raise DataIntegrityError("Migration corrupted data")
                    
                # Commit only if valid
                await tx.commit()
                
        except Exception as e:
            # Automatic rollback
            await self.restore_from_checkpoint(checkpoint)
            raise MigrationFailure(f"Migration failed: {e}")
            
        # Step 5: Post-migration verification
        await self.run_post_migration_tests()
        
        # Step 6: Keep backup for defined period
        await self.backup_manager.retain_backup(
            backup_id, 
            days=30,
            reason="Post-migration safety"
        )
        
    async def create_verified_backup(self):
        # Create backup
        backup_id = await self.backup_manager.create_backup()
        
        # Verify immediately
        verification = await self.backup_manager.verify_backup(backup_id)
        if not verification.passed:
            raise BackupError(f"Backup verification failed: {verification.errors}")
            
        # Test sample restore
        sample_test = await self.backup_manager.test_partial_restore(
            backup_id,
            tables=['critical_table_1', 'critical_table_2']
        )
        if not sample_test.successful:
            raise BackupError("Sample restore failed")
            
        return backup_id

# Database-level safety:
CREATE OR REPLACE FUNCTION safe_migration_wrapper(
    migration_name TEXT,
    migration_sql TEXT
) RETURNS VOID AS $
DECLARE
    v_backup_id INTEGER;
    v_pre_checksum TEXT;
    v_post_checksum TEXT;
BEGIN
    -- Ensure backup exists
    SELECT id INTO v_backup_id
    FROM backup_catalog
    WHERE source_database = current_database()
    AND is_verified = true
    AND backup_end_time > NOW() - INTERVAL '1 hour';
    
    IF v_backup_id IS NULL THEN
        RAISE EXCEPTION 'No recent verified backup found';
    END IF;
    
    -- Calculate pre-migration checksum
    SELECT calculate_database_checksum() INTO v_pre_checksum;
    
    -- Execute migration
    EXECUTE migration_sql;
    
    -- Verify no unexpected changes
    SELECT calculate_database_checksum() INTO v_post_checksum;
    
    -- Log migration
    INSERT INTO migration_log (
        migration_name,
        backup_id,
        pre_checksum,
        post_checksum,
        executed_at
    ) VALUES (
        migration_name,
        v_backup_id,
        v_pre_checksum,
        v_post_checksum,
        NOW()
    );
END;
$ LANGUAGE plpgsql;

AI Coding Guidance/Prompt

Prompt: "When planning data migrations:"
Rules:
  - Never proceed without verified backup
  - Require successful restore test before migration
  - Flag any migration without rollback plan
  - Mandate backup retention beyond migration completion
  - Require data validation queries pre/post migration
  - Never trust untested backups
  
Example:
  # Bad: Migration without safety
  -- Just run the migration and hope
  BEGIN;
  DROP TABLE old_schema.users;
  ALTER TABLE new_schema.temp_users RENAME TO users;
  COMMIT;
  -- No backup verification, no rollback plan
  
  # Good: Safe migration process
  -- Step 1: Verify backup exists and is restorable
  DO $
  DECLARE
      v_backup_verified BOOLEAN;
      v_migration_id UUID := gen_random_uuid();
  BEGIN
      -- Check backup verification
      SELECT 
          EXISTS(
              SELECT 1 FROM backup_verification_tests bvt
              JOIN backup_catalog bc ON bvt.backup_id = bc.id
              WHERE bc.source_database = current_database()
              AND bvt.test_status = 'completed'
              AND bvt.test_end_time > NOW() - INTERVAL '24 hours'
          ) INTO v_backup_verified;
      
      IF NOT v_backup_verified THEN
          RAISE EXCEPTION 'No verified backup in last 24 hours';
      END IF;
      
      -- Record pre-migration state
      INSERT INTO migration_data_snapshots (
          migration_id,
          snapshot_type,
          table_name,
          row_count,
          checksum
      )
      SELECT 
          v_migration_id,
          'pre_migration',
          schemaname||'.'||tablename,
          n_live_tup,
          md5(array_agg(ctid ORDER BY ctid)::text)
      FROM pg_stat_user_tables
      JOIN pg_class ON relname = tablename;
      
      -- Create savepoint for rollback
      SAVEPOINT pre_migration;
      
      -- Perform migration with ability to rollback
      BEGIN
          -- Migration steps here
          ALTER TABLE old_schema.users SET SCHEMA archive_schema;
          ALTER TABLE new_schema.temp_users SET SCHEMA public;
          ALTER TABLE public.temp_users RENAME TO users;
          
          -- Verify data integrity
          IF NOT verify_migration_integrity(v_migration_id) THEN
              RAISE EXCEPTION 'Data integrity check failed';
          END IF;
          
      EXCEPTION
          WHEN OTHERS THEN
              ROLLBACK TO SAVEPOINT pre_migration;
              RAISE;
      END;
      
      -- Record successful migration
      UPDATE migration_safety_checklist
      SET completed = true,
          completed_at = NOW()
      WHERE migration_id = v_migration_id;
  END $;

Relevant Keywords

migration backup recovery weakness Symptoms: slow queries, data inconsistency, constraint violations Preventive: schema validation, constraint enforcement, proper typing Tech stack: PostgreSQL, MySQL, SQL Server, Oracle Industry: all industries, enterprise, SaaS

Related Patterns

The Cantorian Technical Debt Magnitude scale gives developers an intuitive sense of magnitude beyond simple hour counts - some debts aren't just larger in scale, but qualitatively different in their complexity.

Cantor Points are critical decision junctures—or even moments of non-decision—where seemingly small choices can create drastically divergent futures for a system's integrity, security, and evolvability. These are the "forks in the road" where one path might lead to manageable complexity, while another veers towards systemic entanglement or even chaos. They often appear trivial at the time but can set in motion irreversible or costly-to-reverse consequences.

Applied Data Integrity (ADI) is a framework to understanding the far-reaching consequences of schema and data decisions that impact security and reliability, and accumulate into ethical debt that affects real human lives. Built on research from real-world incidents, ADI uncovered 7 Principles to identify when these decisions are being made, and how to make them better, to avoid future technical debt and potentially catastrophic "butterfly effects" of small decisions that ripple into chaotic technical and ethical debt.