SWE-2001: Non-Atomic Schema Migrations

Description

Deploying schema changes and application code simultaneously without backward compatibility, creating windows where the system is in an inconsistent state. This includes "big bang" migrations that change multiple aspects at once, or deployments that assume perfect synchronization between database and application changes.

Illustrative Cantor Point

The Cantor Point occurs when choosing deployment strategy - opting for "deploy everything at once" versus implementing a multi-phase migration. This decision to prioritize speed over safety creates a critical vulnerability window where any rollback becomes extremely complex.

Categories: CP-Schema, CP-API, CP-Process

Real-World Examples / Observed In

GitHub (2021): Table rename during migration triggered MySQL replica failures due to semaphore deadlock
TSB Bank (2018): "Big bang" migration of 5 million accounts failed catastrophically [See: Cases-By-Year/2018 Data Integrity Failures.md#1]
Common Pattern: Application expects new schema while some instances still use old schema, causing 500 errors

Common Consequences & Impacts

Technical Impacts

- Inconsistent state during deployment
- Failed requests during transition
- Complex rollback scenarios
- Potential data corruption

Human/Ethical Impacts

- Users experiencing errors
- On-call stress during deployments
- Fear of making necessary changes
- Accumulated technical debt

Business Impacts

- Service disruptions
- Lost transactions
- Emergency rollbacks
- Reduced deployment confidence

Recovery Difficulty & Escalation

Recovery Difficulty:

5.5

Escalation Risk:

8.5

ADI Principles & Axioms Violated

Principle of Fragile Stability: Transition states are inherently vulnerable
Principle of Cascading Catastrophes: Migration failures cascade across systems
Principle of Inevitable Contradiction: Safety mechanisms can cause failures

Detection / 60-Second Audit

-- Check for pending migrations with breaking changes
SELECT 
    schemaname,
    tablename,
    'ALTER TABLE ' || tablename || ' ADD COLUMN new_col TYPE NOT NULL' as risky_migration
FROM pg_tables
WHERE schemaname = 'public'
AND EXISTS (
    SELECT 1 FROM information_schema.columns
    WHERE table_name = tablename
    AND is_nullable = 'NO'
    AND column_default IS NULL
);

-- Identify tables with high transaction rates (risky for migrations)
SELECT 
    schemaname,
    tablename,
    n_tup_ins + n_tup_upd + n_tup_del as write_volume,
    CASE 
        WHEN n_tup_ins + n_tup_upd + n_tup_del > 10000 THEN 'High risk - needs careful migration'
        ELSE 'Lower risk'
    END as migration_risk
FROM pg_stat_user_tables
WHERE schemaname = 'public'
ORDER BY write_volume DESC
LIMIT 10;

-- Check for tables without proper migration staging
SELECT 
    TABLE_NAME,
    TABLE_ROWS,
    CASE 
        WHEN TABLE_ROWS > 100000 THEN 'Large table - needs online migration'
        ELSE 'Can use standard ALTER'
    END as migration_strategy
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_SCHEMA = DATABASE()
AND TABLE_TYPE = 'BASE TABLE'
ORDER BY TABLE_ROWS DESC;

-- Find columns that would block if made NOT NULL
SELECT 
    TABLE_NAME,
    COLUMN_NAME,
    IS_NULLABLE,
    COLUMN_DEFAULT
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_SCHEMA = DATABASE()
AND IS_NULLABLE = 'YES'
AND COLUMN_DEFAULT IS NULL;

-- Identify tables that need careful migration planning
SELECT 
    t.name AS TableName,
    p.rows AS RowCount,
    CASE 
        WHEN p.rows > 100000 THEN 'Large table - use batched migration'
        ELSE 'Standard migration OK'
    END as MigrationStrategy,
    SUM(a.total_pages) * 8 / 1024 AS TotalSizeMB
FROM sys.tables t
INNER JOIN sys.indexes i ON t.object_id = i.object_id
INNER JOIN sys.partitions p ON i.object_id = p.object_id AND i.index_id = p.index_id
INNER JOIN sys.allocation_units a ON p.partition_id = a.container_id
WHERE t.is_ms_shipped = 0
GROUP BY t.name, p.rows
ORDER BY p.rows DESC;

-- Check for blocking schema changes
SELECT 
    OBJECT_NAME(object_id) as TableName,
    name as ColumnName,
    is_nullable,
    CASE 
        WHEN is_nullable = 1 THEN 'Adding NOT NULL will require table scan'
        ELSE 'Already NOT NULL'
    END as MigrationNote
FROM sys.columns
WHERE is_nullable = 1
AND object_id IN (SELECT object_id FROM sys.tables WHERE is_ms_shipped = 0);

Prevention & Mitigation Best Practices

Expand-Contract Pattern:
- Phase 1: Add new schema elements (backward compatible)
- Phase 2: Deploy code that uses both old and new
- Phase 3: Migrate data and switch to new
- Phase 4: Remove old schema elements
Feature Flags: Gate schema-dependent code
Blue-Green Deployments: Maintain two environments
Backward Compatibility First: New code must work with old schema
Forward Compatibility: Old code should handle new schema gracefully
Migration Testing: Test on production-scale data
Rollback Planning: Every migration needs a reversal plan

Real World Examples

-- Dangerous: Non-atomic migration that will cause downtime
-- This approach deploys schema change and code together

-- Migration script:
ALTER TABLE orders DROP COLUMN customer_name;
ALTER TABLE orders ADD COLUMN customer_id INTEGER NOT NULL;
ALTER TABLE orders ADD FOREIGN KEY (customer_id) REFERENCES customers(id);

-- New application code (deployed simultaneously):
const createOrder = async (customerId, items) => {
  // This will fail if migration hasn't completed
  return db.query(
    'INSERT INTO orders (customer_id, items, total) VALUES ($1, $2, $3)',
    [customerId, items, calculateTotal(items)]
  );
};

-- Another dangerous example: Renaming columns without backward compatibility

-- Migration:
ALTER TABLE users RENAME COLUMN username TO user_name;

-- Application code assumes new column name immediately:
const getUser = async (userId) => {
  // Fails if any app instance hasn't updated yet
  return db.query('SELECT id, user_name, email FROM users WHERE id = $1', [userId]);
};

-- Safe: Multi-phase migration with backward compatibility

-- Phase 1: Add new structure (backward compatible)
ALTER TABLE orders ADD COLUMN customer_id INTEGER;
-- Create trigger to sync old and new columns
CREATE OR REPLACE FUNCTION sync_customer_data() RETURNS TRIGGER AS $
BEGIN
  IF NEW.customer_id IS NOT NULL AND NEW.customer_name IS NULL THEN
    SELECT name INTO NEW.customer_name FROM customers WHERE id = NEW.customer_id;
  ELSIF NEW.customer_name IS NOT NULL AND NEW.customer_id IS NULL THEN
    SELECT id INTO NEW.customer_id FROM customers WHERE name = NEW.customer_name;
  END IF;
  RETURN NEW;
END;
$ LANGUAGE plpgsql;

CREATE TRIGGER orders_customer_sync 
BEFORE INSERT OR UPDATE ON orders 
FOR EACH ROW EXECUTE FUNCTION sync_customer_data();

-- Phase 2: Deploy code that handles both schemas
const createOrder = async (customerIdOrName, items) => {
  // Works with both old and new schema
  const customer = typeof customerIdOrName === 'number' 
    ? { id: customerIdOrName }
    : await getCustomerByName(customerIdOrName);
    
  const columns = db.schemaVersion >= 2 
    ? 'customer_id, items, total'
    : 'customer_name, items, total';
    
  const values = db.schemaVersion >= 2
    ? [customer.id, items, calculateTotal(items)]
    : [customer.name, items, calculateTotal(items)];
    
  return db.query(`INSERT INTO orders (${columns}) VALUES ($1, $2, $3)`, values);
};

-- Phase 3: Backfill data
UPDATE orders o 
SET customer_id = c.id 
FROM customers c 
WHERE o.customer_name = c.name 
AND o.customer_id IS NULL;

-- Phase 4: Add constraints (after all instances updated)
ALTER TABLE orders ALTER COLUMN customer_id SET NOT NULL;
ALTER TABLE orders ADD FOREIGN KEY (customer_id) REFERENCES customers(id);

-- Phase 5: Remove old column (after verification)
DROP TRIGGER orders_customer_sync ON orders;
DROP FUNCTION sync_customer_data();
ALTER TABLE orders DROP COLUMN customer_name;

AI Coding Guidance/Prompt

Prompt: "When planning schema migrations:"
Rules:
  - Never combine schema changes with code deployment
  - Always use multi-phase migrations for breaking changes
  - Require backward compatibility testing
  - Flag any DDL without rollback plan
  
Example:
  # Bad: One-step breaking change
  ALTER TABLE orders DROP COLUMN legacy_status;
  
  # Good: Multi-phase migration
  -- Phase 1: Add new column (deployed independently)
  ALTER TABLE orders ADD COLUMN status VARCHAR(20);
  
  -- Phase 2: Dual-write period (code update)
  UPDATE orders SET status = 
    CASE legacy_status 
      WHEN 1 THEN 'pending'
      WHEN 2 THEN 'completed'
    END;
  
  -- Phase 3: Switch reads to new column (code update)
  -- Phase 4: Stop writing to old column (code update)
  -- Phase 5: Drop old column (after verification)
  ALTER TABLE orders DROP COLUMN legacy_status;

Relevant Keywords

atomic schema migrations Symptoms: slow queries, data inconsistency, constraint violations Preventive: schema validation, constraint enforcement, proper typing Tech stack: PostgreSQL, MySQL, SQL Server, Oracle Industry: all industries, enterprise, SaaS