Non-Atomic Schema Migrations
Description
Deploying schema changes and application code simultaneously without backward compatibility, creating windows where the system is in an inconsistent state. This includes "big bang" migrations that change multiple aspects at once, or deployments that assume perfect synchronization between database and application changes.
Illustrative Cantor Point
The Cantor Point occurs when choosing deployment strategy - opting for "deploy everything at once" versus implementing a multi-phase migration. This decision to prioritize speed over safety creates a critical vulnerability window where any rollback becomes extremely complex.
Real-World Examples / Observed In
- GitHub (2021): Table rename during migration triggered MySQL replica failures due to semaphore deadlock
- TSB Bank (2018): "Big bang" migration of 5 million accounts failed catastrophically [See: Cases-By-Year/2018 Data Integrity Failures.md#1]
- Common Pattern: Application expects new schema while some instances still use old schema, causing 500 errors
Common Consequences & Impacts
Technical Impacts
- - Inconsistent state during deployment
- - Failed requests during transition
- - Complex rollback scenarios
- - Potential data corruption
Human/Ethical Impacts
- - Users experiencing errors
- - On-call stress during deployments
- - Fear of making necessary changes
- - Accumulated technical debt
Business Impacts
- - Service disruptions
- - Lost transactions
- - Emergency rollbacks
- - Reduced deployment confidence
Recovery Difficulty & Escalation
ADI Principles & Axioms Violated
- Principle of Fragile Stability: Transition states are inherently vulnerable
- Principle of Cascading Catastrophes: Migration failures cascade across systems
- Principle of Inevitable Contradiction: Safety mechanisms can cause failures
Detection / 60-Second Audit
-- Check for pending migrations with breaking changes
SELECT
schemaname,
tablename,
'ALTER TABLE ' || tablename || ' ADD COLUMN new_col TYPE NOT NULL' as risky_migration
FROM pg_tables
WHERE schemaname = 'public'
AND EXISTS (
SELECT 1 FROM information_schema.columns
WHERE table_name = tablename
AND is_nullable = 'NO'
AND column_default IS NULL
);
-- Identify tables with high transaction rates (risky for migrations)
SELECT
schemaname,
tablename,
n_tup_ins + n_tup_upd + n_tup_del as write_volume,
CASE
WHEN n_tup_ins + n_tup_upd + n_tup_del > 10000 THEN 'High risk - needs careful migration'
ELSE 'Lower risk'
END as migration_risk
FROM pg_stat_user_tables
WHERE schemaname = 'public'
ORDER BY write_volume DESC
LIMIT 10;
-- Check for tables without proper migration staging
SELECT
TABLE_NAME,
TABLE_ROWS,
CASE
WHEN TABLE_ROWS > 100000 THEN 'Large table - needs online migration'
ELSE 'Can use standard ALTER'
END as migration_strategy
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_SCHEMA = DATABASE()
AND TABLE_TYPE = 'BASE TABLE'
ORDER BY TABLE_ROWS DESC;
-- Find columns that would block if made NOT NULL
SELECT
TABLE_NAME,
COLUMN_NAME,
IS_NULLABLE,
COLUMN_DEFAULT
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_SCHEMA = DATABASE()
AND IS_NULLABLE = 'YES'
AND COLUMN_DEFAULT IS NULL;
-- Identify tables that need careful migration planning
SELECT
t.name AS TableName,
p.rows AS RowCount,
CASE
WHEN p.rows > 100000 THEN 'Large table - use batched migration'
ELSE 'Standard migration OK'
END as MigrationStrategy,
SUM(a.total_pages) * 8 / 1024 AS TotalSizeMB
FROM sys.tables t
INNER JOIN sys.indexes i ON t.object_id = i.object_id
INNER JOIN sys.partitions p ON i.object_id = p.object_id AND i.index_id = p.index_id
INNER JOIN sys.allocation_units a ON p.partition_id = a.container_id
WHERE t.is_ms_shipped = 0
GROUP BY t.name, p.rows
ORDER BY p.rows DESC;
-- Check for blocking schema changes
SELECT
OBJECT_NAME(object_id) as TableName,
name as ColumnName,
is_nullable,
CASE
WHEN is_nullable = 1 THEN 'Adding NOT NULL will require table scan'
ELSE 'Already NOT NULL'
END as MigrationNote
FROM sys.columns
WHERE is_nullable = 1
AND object_id IN (SELECT object_id FROM sys.tables WHERE is_ms_shipped = 0);
Prevention & Mitigation Best Practices
Expand-Contract Pattern:
- Phase 1: Add new schema elements (backward compatible)
- Phase 2: Deploy code that uses both old and new
- Phase 3: Migrate data and switch to new
- Phase 4: Remove old schema elements
Feature Flags: Gate schema-dependent code
Blue-Green Deployments: Maintain two environments
Backward Compatibility First: New code must work with old schema
Forward Compatibility: Old code should handle new schema gracefully
Migration Testing: Test on production-scale data
Rollback Planning: Every migration needs a reversal plan
Real World Examples
-- Dangerous: Non-atomic migration that will cause downtime
-- This approach deploys schema change and code together
-- Migration script:
ALTER TABLE orders DROP COLUMN customer_name;
ALTER TABLE orders ADD COLUMN customer_id INTEGER NOT NULL;
ALTER TABLE orders ADD FOREIGN KEY (customer_id) REFERENCES customers(id);
-- New application code (deployed simultaneously):
const createOrder = async (customerId, items) => {
// This will fail if migration hasn't completed
return db.query(
'INSERT INTO orders (customer_id, items, total) VALUES ($1, $2, $3)',
[customerId, items, calculateTotal(items)]
);
};
-- Another dangerous example: Renaming columns without backward compatibility
-- Migration:
ALTER TABLE users RENAME COLUMN username TO user_name;
-- Application code assumes new column name immediately:
const getUser = async (userId) => {
// Fails if any app instance hasn't updated yet
return db.query('SELECT id, user_name, email FROM users WHERE id = $1', [userId]);
};
-- Safe: Multi-phase migration with backward compatibility
-- Phase 1: Add new structure (backward compatible)
ALTER TABLE orders ADD COLUMN customer_id INTEGER;
-- Create trigger to sync old and new columns
CREATE OR REPLACE FUNCTION sync_customer_data() RETURNS TRIGGER AS $
BEGIN
IF NEW.customer_id IS NOT NULL AND NEW.customer_name IS NULL THEN
SELECT name INTO NEW.customer_name FROM customers WHERE id = NEW.customer_id;
ELSIF NEW.customer_name IS NOT NULL AND NEW.customer_id IS NULL THEN
SELECT id INTO NEW.customer_id FROM customers WHERE name = NEW.customer_name;
END IF;
RETURN NEW;
END;
$ LANGUAGE plpgsql;
CREATE TRIGGER orders_customer_sync
BEFORE INSERT OR UPDATE ON orders
FOR EACH ROW EXECUTE FUNCTION sync_customer_data();
-- Phase 2: Deploy code that handles both schemas
const createOrder = async (customerIdOrName, items) => {
// Works with both old and new schema
const customer = typeof customerIdOrName === 'number'
? { id: customerIdOrName }
: await getCustomerByName(customerIdOrName);
const columns = db.schemaVersion >= 2
? 'customer_id, items, total'
: 'customer_name, items, total';
const values = db.schemaVersion >= 2
? [customer.id, items, calculateTotal(items)]
: [customer.name, items, calculateTotal(items)];
return db.query(`INSERT INTO orders (${columns}) VALUES ($1, $2, $3)`, values);
};
-- Phase 3: Backfill data
UPDATE orders o
SET customer_id = c.id
FROM customers c
WHERE o.customer_name = c.name
AND o.customer_id IS NULL;
-- Phase 4: Add constraints (after all instances updated)
ALTER TABLE orders ALTER COLUMN customer_id SET NOT NULL;
ALTER TABLE orders ADD FOREIGN KEY (customer_id) REFERENCES customers(id);
-- Phase 5: Remove old column (after verification)
DROP TRIGGER orders_customer_sync ON orders;
DROP FUNCTION sync_customer_data();
ALTER TABLE orders DROP COLUMN customer_name;
AI Coding Guidance/Prompt
Prompt: "When planning schema migrations:"
Rules:
- Never combine schema changes with code deployment
- Always use multi-phase migrations for breaking changes
- Require backward compatibility testing
- Flag any DDL without rollback plan
Example:
# Bad: One-step breaking change
ALTER TABLE orders DROP COLUMN legacy_status;
# Good: Multi-phase migration
-- Phase 1: Add new column (deployed independently)
ALTER TABLE orders ADD COLUMN status VARCHAR(20);
-- Phase 2: Dual-write period (code update)
UPDATE orders SET status =
CASE legacy_status
WHEN 1 THEN 'pending'
WHEN 2 THEN 'completed'
END;
-- Phase 3: Switch reads to new column (code update)
-- Phase 4: Stop writing to old column (code update)
-- Phase 5: Drop old column (after verification)
ALTER TABLE orders DROP COLUMN legacy_status;
Relevant Keywords
atomic schema migrations Symptoms: slow queries, data inconsistency, constraint violations Preventive: schema validation, constraint enforcement, proper typing Tech stack: PostgreSQL, MySQL, SQL Server, Oracle Industry: all industries, enterprise, SaaS