Migration Backup and Recovery Weakness
Description
Migration processes executed without verified backup and recovery capabilities. This includes untested backups, missing recovery procedures, lack of backup integrity verification, or proceeding with irreversible operations before confirming data recoverability. Often discovered only after catastrophic data loss has occurred.
Illustrative Cantor Point
The Cantor Point occurs when planning migrations - choosing between thorough backup verification that delays the migration versus proceeding with assumptions about backup viability. The decision to skip backup testing creates a divergent path where migration failures become permanent data loss rather than temporary setbacks.
Real-World Examples / Observed In
- Myspace (2019): Lost 50 million songs and 12 years of user content during migration due to untested backups [See: Cases-By-Year/2019 Data Integrity Failures.md#3]
- GitLab (2017): Nearly lost entire production database, saved only by an accidental backup
- Toy Story 2 (1998): Pixar almost lost the entire movie, saved by animator's home backup
Common Consequences & Impacts
Technical Impacts
- - Permanent data loss
- - Inability to rollback
- - Corrupted data states
- - Lost system configurations
Human/Ethical Impacts
- - Lost memories/content
- - Creative work destruction
- - Historical record gaps
- - Trust violation
Business Impacts
- - Irreplaceable content loss
- - Business continuity failure
- - Legal liability
- - Reputation destruction
Recovery Difficulty & Escalation
ADI Principles & Axioms Violated
- Principle of Temporal Integrity: Past states must remain recoverable
- Principle of Verified Trust: Never trust untested backups
Detection / 60-Second Audit
```sql
-- Check for backup tracking
SELECT
EXISTS (
SELECT 1 FROM information_schema.tables
WHERE table_name IN ('backup_history', 'backup_tests', 'recovery_tests')
) as has_backup_tracking;
-- Identify tables without recent backups
WITH backup_coverage AS (
SELECT
schemaname,
tablename,
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) as table_size,
pg_stat_get_last_analyze_time(c.oid) as last_analyzed
FROM pg_stat_user_tables
JOIN pg_class c ON c.relname = tablename
)
SELECT
schemaname,
tablename,
table_size,
CASE
WHEN last_analyzed IS NULL THEN 'NEVER ANALYZED - Backup status unknown'
WHEN last_analyzed < NOW() - INTERVAL '30 days' THEN 'Stale statistics - verify backup'
ELSE 'Recently analyzed'
END as backup_concern
FROM backup_coverage
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;
-- Check for migration planning tables
SELECT
table_name,
column_name,
CASE
WHEN column_name LIKE '%backup%' THEN 'Has backup reference'
WHEN column_name LIKE '%rollback%' THEN 'Has rollback planning'
WHEN column_name LIKE '%verified%' THEN 'Has verification tracking'
ELSE 'Check column purpose'
END as migration_safety_feature
FROM information_schema.columns
WHERE table_name LIKE '%migration%'
AND (column_name LIKE '%backup%'
OR column_name LIKE '%rollback%'
OR column_name LIKE '%verif%');
```
```sql
-- Check for backup tracking
SELECT
EXISTS (
SELECT 1 FROM information_schema.tables
WHERE table_schema = DATABASE()
AND table_name IN ('backup_history', 'backup_tests', 'recovery_tests')
) as has_backup_tracking;
-- Identify large tables (backup priority)
SELECT
table_schema,
table_name,
ROUND(data_length/1024/1024, 2) as table_size_mb,
ROUND(index_length/1024/1024, 2) as index_size_mb,
table_rows,
CASE
WHEN update_time IS NULL THEN 'Never updated - check backup coverage'
WHEN update_time < NOW() - INTERVAL 30 DAY THEN 'Not recently updated'
ELSE 'Recently updated - ensure backup current'
END as backup_priority
FROM information_schema.tables
WHERE table_schema = DATABASE()
AND table_type = 'BASE TABLE'
ORDER BY data_length DESC
LIMIT 20;
-- Check for migration planning tables
SELECT
table_name,
column_name,
CASE
WHEN column_name LIKE '%backup%' THEN 'Has backup reference'
WHEN column_name LIKE '%rollback%' THEN 'Has rollback planning'
WHEN column_name LIKE '%verified%' THEN 'Has verification tracking'
ELSE 'Check column purpose'
END as migration_safety_feature
FROM information_schema.columns
WHERE table_schema = DATABASE()
AND table_name LIKE '%migration%'
AND (column_name LIKE '%backup%'
OR column_name LIKE '%rollback%'
OR column_name LIKE '%verif%');
```
```sql
-- Check for backup tracking
SELECT
CASE WHEN EXISTS (
SELECT 1 FROM sys.tables
WHERE name IN ('backup_history', 'backup_tests', 'recovery_tests')
) THEN 1 ELSE 0 END as has_backup_tracking;
-- Check backup history
SELECT TOP 10
database_name,
backup_start_date,
backup_finish_date,
type as backup_type,
DATEDIFF(hour, backup_finish_date, GETDATE()) as hours_since_backup,
CASE
WHEN DATEDIFF(hour, backup_finish_date, GETDATE()) > 24 THEN 'OVERDUE'
WHEN DATEDIFF(hour, backup_finish_date, GETDATE()) > 12 THEN 'WARNING'
ELSE 'OK'
END as backup_status
FROM msdb.dbo.backupset
WHERE database_name = DB_NAME()
ORDER BY backup_start_date DESC;
-- Check for migration planning tables
SELECT
t.name as table_name,
c.name as column_name,
CASE
WHEN c.name LIKE '%backup%' THEN 'Has backup reference'
WHEN c.name LIKE '%rollback%' THEN 'Has rollback planning'
WHEN c.name LIKE '%verified%' THEN 'Has verification tracking'
ELSE 'Check column purpose'
END as migration_safety_feature
FROM sys.tables t
JOIN sys.columns c ON t.object_id = c.object_id
WHERE t.name LIKE '%migration%'
AND (c.name LIKE '%backup%'
OR c.name LIKE '%rollback%'
OR c.name LIKE '%verif%');
Prevention & Mitigation Best Practices
Comprehensive Backup Management Schema:
CREATE TABLE backup_catalog ( id SERIAL PRIMARY KEY, backup_name VARCHAR(255) UNIQUE NOT NULL, backup_type VARCHAR(50) CHECK (backup_type IN ('full', 'incremental', 'differential')), source_database VARCHAR(255) NOT NULL, backup_start_time TIMESTAMP WITH TIME ZONE NOT NULL, backup_end_time TIMESTAMP WITH TIME ZONE, backup_size_bytes BIGINT, backup_location TEXT NOT NULL, encryption_key_id VARCHAR(255), compression_type VARCHAR(50), retention_days INTEGER DEFAULT 90, is_verified BOOLEAN DEFAULT false, verification_date TIMESTAMP WITH TIME ZONE, created_by VARCHAR(255) NOT NULL ); CREATE TABLE backup_verification_tests ( id SERIAL PRIMARY KEY, backup_id INTEGER REFERENCES backup_catalog(id), test_type VARCHAR(50) CHECK (test_type IN ('integrity', 'restore', 'partial_restore', 'data_validation')), test_start_time TIMESTAMP WITH TIME ZONE DEFAULT NOW(), test_end_time TIMESTAMP WITH TIME ZONE, test_status VARCHAR(50), test_details JSONB, tested_by VARCHAR(255), test_environment VARCHAR(255), rows_verified BIGINT, checksum_match BOOLEAN ); CREATE TABLE recovery_procedures ( id SERIAL PRIMARY KEY, procedure_name VARCHAR(255) UNIQUE NOT NULL, procedure_type VARCHAR(50), estimated_recovery_time INTERVAL, last_tested_date DATE, test_success_rate DECIMAL(5,2), procedure_steps JSONB, required_resources TEXT[], contact_list JSONB );Migration Safety Framework:
CREATE TABLE migration_safety_checklist ( id SERIAL PRIMARY KEY, migration_id UUID NOT NULL, checklist_item VARCHAR(255) NOT NULL, is_required BOOLEAN DEFAULT true, completed BOOLEAN DEFAULT false, completed_by VARCHAR(255), completed_at TIMESTAMP WITH TIME ZONE, evidence TEXT, UNIQUE(migration_id, checklist_item) ); -- Mandatory checklist items INSERT INTO migration_safety_checklist (migration_id, checklist_item) VALUES (gen_random_uuid(), 'Full backup completed'), (gen_random_uuid(), 'Backup verified restorable'), (gen_random_uuid(), 'Recovery procedure documented'), (gen_random_uuid(), 'Recovery time tested'), (gen_random_uuid(), 'Rollback plan approved'), (gen_random_uuid(), 'Data integrity queries prepared'), (gen_random_uuid(), 'Stakeholders notified'), (gen_random_uuid(), 'Maintenance window scheduled'); -- Enforce checklist completion CREATE OR REPLACE FUNCTION can_proceed_with_migration( p_migration_id UUID ) RETURNS BOOLEAN AS $ DECLARE v_incomplete_required INTEGER; BEGIN SELECT COUNT(*) INTO v_incomplete_required FROM migration_safety_checklist WHERE migration_id = p_migration_id AND is_required = true AND completed = false; RETURN v_incomplete_required = 0; END; $ LANGUAGE plpgsql;Backup Testing Automation:
CREATE OR REPLACE FUNCTION test_backup_recovery( p_backup_id INTEGER ) RETURNS TABLE ( test_phase VARCHAR, status VARCHAR, details TEXT ) AS $ DECLARE v_backup backup_catalog; v_test_db VARCHAR; v_start_time TIMESTAMP WITH TIME ZONE; BEGIN SELECT * INTO v_backup FROM backup_catalog WHERE id = p_backup_id; v_test_db := 'restore_test_' || p_backup_id; v_start_time := NOW(); -- Phase 1: Create test database RETURN QUERY SELECT 'create_test_db'::VARCHAR, 'started'::VARCHAR, 'Creating test database ' || v_test_db::TEXT; EXECUTE 'CREATE DATABASE ' || quote_ident(v_test_db); -- Phase 2: Restore backup RETURN QUERY SELECT 'restore_backup'::VARCHAR, 'started'::VARCHAR, 'Restoring from ' || v_backup.backup_location::TEXT; -- (Actual restore command would go here) -- Phase 3: Verify data integrity RETURN QUERY SELECT 'verify_integrity'::VARCHAR, 'started'::VARCHAR, 'Running integrity checks'::TEXT; -- Phase 4: Test specific queries RETURN QUERY SELECT 'test_queries'::VARCHAR, 'started'::VARCHAR, 'Testing critical queries'::TEXT; -- Log results INSERT INTO backup_verification_tests ( backup_id, test_type, test_start_time, test_end_time, test_status, test_environment ) VALUES ( p_backup_id, 'restore', v_start_time, NOW(), 'completed', v_test_db ); -- Cleanup EXECUTE 'DROP DATABASE IF EXISTS ' || quote_ident(v_test_db); RETURN QUERY SELECT 'cleanup'::VARCHAR, 'completed'::VARCHAR, 'Test database removed'::TEXT; END; $ LANGUAGE plpgsql;Point-in-Time Recovery Tracking:
CREATE TABLE pitr_capability ( id SERIAL PRIMARY KEY, database_name VARCHAR(255), earliest_recovery_point TIMESTAMP WITH TIME ZONE, latest_recovery_point TIMESTAMP WITH TIME ZONE, wal_archiving_enabled BOOLEAN, archive_location TEXT, retention_policy_days INTEGER, last_verification_date DATE, verification_status VARCHAR(50) ); -- Monitor recovery capability gaps CREATE VIEW recovery_capability_gaps AS SELECT database_name, NOW() - latest_recovery_point as recovery_point_lag, CASE WHEN NOW() - latest_recovery_point > INTERVAL '1 hour' THEN 'CRITICAL' WHEN NOW() - latest_recovery_point > INTERVAL '15 minutes' THEN 'WARNING' ELSE 'OK' END as status, CASE WHEN NOT wal_archiving_enabled THEN 'WAL archiving disabled!' WHEN last_verification_date < CURRENT_DATE - 7 THEN 'Verification overdue!' ELSE 'Normal' END as issues FROM pitr_capability ORDER BY recovery_point_lag DESC;Additional Best Practices:
- Implement 3-2-1 backup rule (3 copies, 2 different media, 1 offsite)
- Test recovery procedures quarterly minimum
- Document recovery time objectives (RTO) and recovery point objectives (RPO)
- Use immutable backups to prevent ransomware
- Implement automated backup verification
- Create "chaos recovery" drills
Real World Examples
# What happened:
- Server migration project initiated
- 50 million songs from 2003-2015
- 12 years of user-uploaded content
- Photos, videos, and audio files
# The failure chain:
1. Migration team assumed backups existed
2. No backup verification performed
3. Data copied to new servers
4. Old servers decommissioned
5. Corruption discovered on new servers
6. Backups either corrupted or never existed
7. Data permanently lost
# Impact:
- 14 million artists' work gone forever
- 53 million songs erased
- Cultural history destroyed
- Class action lawsuits
- MySpace's remaining reputation destroyed
# What they said:
"As a result of a server migration project,
any photos, videos, and audio files you
uploaded more than three years ago may
no longer be available."
# The near-disaster timeline:
# 2017-01-31 - Production database under heavy load
# Engineer action (trying to fix replication):
$ sudo gitlab-ctl reconfigure
$ sudo gitlab-ctl restart postgresql
# Mistake: Ran on PRODUCTION instead of staging
$ sudo rm -rf /var/opt/gitlab/postgresql/data
# Realized mistake immediately but too late
# Backup status discovered:
- Backup 1: Not enabled (configuration error)
- Backup 2: Failing silently for months
- Backup 3: Only uploading empty files
- Backup 4: Not turned on
- Backup 5: Outdated (6 hours old)
# Saved by:
- One engineer's manual backup from 6 hours earlier
- Frantically copying before more data lost
# Lessons:
- Multiple backup systems ALL failed
- Nobody was testing restore procedures
- Monitoring wasn't alerting on backup failures
# Safe migration implementation:
class SafeMigrationManager:
def __init__(self):
self.backup_manager = BackupManager()
self.validator = DataValidator()
async def execute_migration(self, migration_plan):
# Step 1: Create and verify backup
backup_id = await self.create_verified_backup()
# Step 2: Test restore capability
if not await self.test_restore(backup_id):
raise MigrationError("Backup restore test failed")
# Step 3: Create migration checkpoint
checkpoint = await self.create_checkpoint()
# Step 4: Execute with monitoring
try:
# Take pre-migration snapshot
pre_snapshot = await self.validator.create_snapshot()
# Run migration in transaction
async with self.db.transaction() as tx:
await migration_plan.execute()
# Verify data integrity
post_snapshot = await self.validator.create_snapshot()
if not self.validator.compare_snapshots(pre_snapshot, post_snapshot):
raise DataIntegrityError("Migration corrupted data")
# Commit only if valid
await tx.commit()
except Exception as e:
# Automatic rollback
await self.restore_from_checkpoint(checkpoint)
raise MigrationFailure(f"Migration failed: {e}")
# Step 5: Post-migration verification
await self.run_post_migration_tests()
# Step 6: Keep backup for defined period
await self.backup_manager.retain_backup(
backup_id,
days=30,
reason="Post-migration safety"
)
async def create_verified_backup(self):
# Create backup
backup_id = await self.backup_manager.create_backup()
# Verify immediately
verification = await self.backup_manager.verify_backup(backup_id)
if not verification.passed:
raise BackupError(f"Backup verification failed: {verification.errors}")
# Test sample restore
sample_test = await self.backup_manager.test_partial_restore(
backup_id,
tables=['critical_table_1', 'critical_table_2']
)
if not sample_test.successful:
raise BackupError("Sample restore failed")
return backup_id
# Database-level safety:
CREATE OR REPLACE FUNCTION safe_migration_wrapper(
migration_name TEXT,
migration_sql TEXT
) RETURNS VOID AS $
DECLARE
v_backup_id INTEGER;
v_pre_checksum TEXT;
v_post_checksum TEXT;
BEGIN
-- Ensure backup exists
SELECT id INTO v_backup_id
FROM backup_catalog
WHERE source_database = current_database()
AND is_verified = true
AND backup_end_time > NOW() - INTERVAL '1 hour';
IF v_backup_id IS NULL THEN
RAISE EXCEPTION 'No recent verified backup found';
END IF;
-- Calculate pre-migration checksum
SELECT calculate_database_checksum() INTO v_pre_checksum;
-- Execute migration
EXECUTE migration_sql;
-- Verify no unexpected changes
SELECT calculate_database_checksum() INTO v_post_checksum;
-- Log migration
INSERT INTO migration_log (
migration_name,
backup_id,
pre_checksum,
post_checksum,
executed_at
) VALUES (
migration_name,
v_backup_id,
v_pre_checksum,
v_post_checksum,
NOW()
);
END;
$ LANGUAGE plpgsql;
AI Coding Guidance/Prompt
Prompt: "When planning data migrations:"
Rules:
- Never proceed without verified backup
- Require successful restore test before migration
- Flag any migration without rollback plan
- Mandate backup retention beyond migration completion
- Require data validation queries pre/post migration
- Never trust untested backups
Example:
# Bad: Migration without safety
-- Just run the migration and hope
BEGIN;
DROP TABLE old_schema.users;
ALTER TABLE new_schema.temp_users RENAME TO users;
COMMIT;
-- No backup verification, no rollback plan
# Good: Safe migration process
-- Step 1: Verify backup exists and is restorable
DO $
DECLARE
v_backup_verified BOOLEAN;
v_migration_id UUID := gen_random_uuid();
BEGIN
-- Check backup verification
SELECT
EXISTS(
SELECT 1 FROM backup_verification_tests bvt
JOIN backup_catalog bc ON bvt.backup_id = bc.id
WHERE bc.source_database = current_database()
AND bvt.test_status = 'completed'
AND bvt.test_end_time > NOW() - INTERVAL '24 hours'
) INTO v_backup_verified;
IF NOT v_backup_verified THEN
RAISE EXCEPTION 'No verified backup in last 24 hours';
END IF;
-- Record pre-migration state
INSERT INTO migration_data_snapshots (
migration_id,
snapshot_type,
table_name,
row_count,
checksum
)
SELECT
v_migration_id,
'pre_migration',
schemaname||'.'||tablename,
n_live_tup,
md5(array_agg(ctid ORDER BY ctid)::text)
FROM pg_stat_user_tables
JOIN pg_class ON relname = tablename;
-- Create savepoint for rollback
SAVEPOINT pre_migration;
-- Perform migration with ability to rollback
BEGIN
-- Migration steps here
ALTER TABLE old_schema.users SET SCHEMA archive_schema;
ALTER TABLE new_schema.temp_users SET SCHEMA public;
ALTER TABLE public.temp_users RENAME TO users;
-- Verify data integrity
IF NOT verify_migration_integrity(v_migration_id) THEN
RAISE EXCEPTION 'Data integrity check failed';
END IF;
EXCEPTION
WHEN OTHERS THEN
ROLLBACK TO SAVEPOINT pre_migration;
RAISE;
END;
-- Record successful migration
UPDATE migration_safety_checklist
SET completed = true,
completed_at = NOW()
WHERE migration_id = v_migration_id;
END $;
Relevant Keywords
migration backup recovery weakness Symptoms: slow queries, data inconsistency, constraint violations Preventive: schema validation, constraint enforcement, proper typing Tech stack: PostgreSQL, MySQL, SQL Server, Oracle Industry: all industries, enterprise, SaaS