SWE-1002

Under-Normalization (God Tables)

Cantorian Technical Debt Magnitude: ℵ₁ (Systemic)

Description

Creating "God tables" with dozens or hundreds of columns that mix multiple entity types or business concepts. Includes anti-patterns like repeated columns (phone1, phone2, phone3) and catch-all tables serving multiple purposes. Often starts as a shortcut but evolves into unmaintainable structures with extensive NULL values and complex conditional logic.

Illustrative Cantor Point

The Cantor Point occurs when deciding to "just add another column" instead of creating a proper relationship. Each time this convenience is chosen over proper design, the table grows wider and more unwieldy, eventually becoming impossible to refactor without massive effort.

Categories: CP-Schema, CP-API, CP-Process

Real-World Examples / Observed In

  • CRM System: Customers table with 120 columns including multiple addresses, contacts, preferences, and historical data
  • E-commerce Platform: Products table with 500+ columns trying to accommodate every possible product attribute
  • Event System: Single events table with generic columns (int_field1, text_field1) for different event types

Common Consequences & Impacts

Technical Impacts

  • - Excessive NULL values wasting storage
  • - Wide rows impacting query performance
  • - Update anomalies from redundant data
  • - Schema changes affect entire application

Human/Ethical Impacts

  • - Data quality issues affecting users
  • - Developer frustration with complexity
  • - Increased bugs from conditional logic
  • - Technical debt impeding innovation

Business Impacts

  • - Slow feature development
  • - High risk of data inconsistency
  • - Difficult onboarding for new developers
  • - Resistance to necessary changes

Recovery Difficulty & Escalation

6.5
4.5

ADI Principles & Axioms Violated

  • Principle of Cascading Catastrophes: Small conveniences accumulate into massive problems
  • Principle of Invisible Decay: Table complexity grows slowly until it's unmanageable
  • Principle of Consequential Stewardship: Poor structure leads to poor data quality

Detection / 60-Second Audit

-- Find suspiciously wide tables
SELECT 
    table_name,
    COUNT(*) as column_count
FROM information_schema.columns
WHERE table_schema = 'public'
GROUP BY table_name
HAVING COUNT(*) > 30
ORDER BY column_count DESC;

-- Detect repeated column patterns
SELECT 
    table_name,
    array_agg(column_name ORDER BY column_name) as suspicious_columns
FROM information_schema.columns
WHERE table_schema = 'public'
AND column_name ~ '.*[0-9]+{{{detection}}}#039;  -- Ends with number
GROUP BY table_name
HAVING COUNT(*) > 2;

-- Find tables with high NULL percentage
SELECT 
    schemaname,
    tablename,
    attname,
    null_frac
FROM pg_stats
WHERE schemaname = 'public'
AND null_frac > 0.5
ORDER BY null_frac DESC;
-- Find suspiciously wide tables
SELECT 
    TABLE_NAME,
    COUNT(*) as column_count
FROM information_schema.COLUMNS
WHERE TABLE_SCHEMA = DATABASE()
GROUP BY TABLE_NAME
HAVING COUNT(*) > 30
ORDER BY column_count DESC;

-- Detect repeated column patterns
SELECT 
    TABLE_NAME,
    GROUP_CONCAT(COLUMN_NAME ORDER BY COLUMN_NAME) as suspicious_columns
FROM information_schema.COLUMNS
WHERE TABLE_SCHEMA = DATABASE()
AND COLUMN_NAME REGEXP '[0-9]+{{{detection}}}#039;  -- Ends with number
GROUP BY TABLE_NAME
HAVING COUNT(*) > 2;

-- Check for sparse columns (requires analyzing specific table)
SELECT 
    'Run this for each wide table:' as instruction,
    CONCAT('SELECT COUNT(*) as total_rows, ',
           'SUM(CASE WHEN column_name IS NULL THEN 1 ELSE 0 END) as null_count ',
           'FROM table_name;') as query_template;
-- Find suspiciously wide tables
SELECT 
    t.name AS table_name,
    COUNT(*) as column_count
FROM sys.columns c
JOIN sys.tables t ON c.object_id = t.object_id
WHERE t.type = 'U'
GROUP BY t.name
HAVING COUNT(*) > 30
ORDER BY column_count DESC;

-- Detect repeated column patterns
SELECT 
    t.name AS table_name,
    STRING_AGG(c.name, ', ') as suspicious_columns
FROM sys.columns c
JOIN sys.tables t ON c.object_id = t.object_id
WHERE t.type = 'U'
AND c.name LIKE '%[0-9]'
GROUP BY t.name
HAVING COUNT(*) > 2;

Prevention & Mitigation Best Practices

  1. One Entity, One Table: Each table represents exactly one business concept
  2. Relationship Recognition: Use separate tables for 1-to-many relationships
  3. Column Naming Discipline: Numbered columns (field1, field2) indicate design issues
  4. NULL Analysis: Many NULLs suggest mixed entity types
  5. Incremental Refactoring: Extract cohesive column groups into new tables
  6. Domain Modeling: Define clear boundaries between entities
  7. Code Review Focus: Question every column addition to wide tables

Real World Examples

-- God table trying to handle all customer data
CREATE TABLE customers (
    id BIGINT PRIMARY KEY,
    -- Basic info
    first_name VARCHAR(50),
    last_name VARCHAR(50),
    email VARCHAR(255),
    
    -- Multiple phones (anti-pattern)
    phone1 VARCHAR(20),
    phone1_type VARCHAR(10),
    phone2 VARCHAR(20),
    phone2_type VARCHAR(10),
    phone3 VARCHAR(20),
    phone3_type VARCHAR(10),
    
    -- Multiple addresses (anti-pattern)
    home_address VARCHAR(500),
    home_city VARCHAR(100),
    home_state VARCHAR(2),
    home_zip VARCHAR(10),
    work_address VARCHAR(500),
    work_city VARCHAR(100),
    work_state VARCHAR(2),
    work_zip VARCHAR(10),
    
    -- Mixed concerns (preferences)
    pref_language VARCHAR(10),
    pref_currency VARCHAR(3),
    pref_timezone VARCHAR(50),
    pref_newsletter BOOLEAN,
    pref_sms BOOLEAN,
    
    -- Order history (wrong place)
    first_order_date DATE,
    last_order_date DATE,
    total_orders INT,
    total_spent DECIMAL(10,2),
    average_order_value DECIMAL(10,2),
    
    -- And 80 more columns...
);
-- Generic "data" table anti-pattern
CREATE TABLE entity_data (
    id BIGINT PRIMARY KEY,
    entity_type VARCHAR(50), -- 'customer', 'product', 'order'
    entity_id BIGINT,
    
    -- Generic columns for any data type
    string_val1 VARCHAR(255),
    string_val2 VARCHAR(255),
    string_val3 TEXT,
    int_val1 INT,
    int_val2 INT,
    decimal_val1 DECIMAL(15,2),
    decimal_val2 DECIMAL(15,2),
    date_val1 DATE,
    date_val2 DATE,
    json_data JSON,
    
    -- Metadata
    created_at TIMESTAMP,
    updated_at TIMESTAMP
);
-- Queries become nightmares of conditional logic
-- Properly normalized schema
CREATE TABLE customers (
    id BIGINT PRIMARY KEY,
    first_name VARCHAR(50) NOT NULL,
    last_name VARCHAR(50) NOT NULL,
    email VARCHAR(255) NOT NULL UNIQUE,
    created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP
);

CREATE TABLE customer_phones (
    id BIGINT PRIMARY KEY,
    customer_id BIGINT NOT NULL REFERENCES customers(id),
    phone_type VARCHAR(20) NOT NULL,
    phone_number VARCHAR(20) NOT NULL,
    is_primary BOOLEAN DEFAULT FALSE,
    UNIQUE(customer_id, phone_type)
);

CREATE TABLE customer_addresses (
    id BIGINT PRIMARY KEY,
    customer_id BIGINT NOT NULL REFERENCES customers(id),
    address_type VARCHAR(20) NOT NULL,
    street_address VARCHAR(255) NOT NULL,
    city VARCHAR(100) NOT NULL,
    state_province VARCHAR(100),
    postal_code VARCHAR(20),
    country_code CHAR(2) NOT NULL,
    is_primary BOOLEAN DEFAULT FALSE,
    UNIQUE(customer_id, address_type)
);

CREATE TABLE customer_preferences (
    customer_id BIGINT PRIMARY KEY REFERENCES customers(id),
    language_code VARCHAR(10) DEFAULT 'en',
    currency_code VARCHAR(3) DEFAULT 'USD',
    timezone VARCHAR(50) DEFAULT 'UTC',
    newsletter_enabled BOOLEAN DEFAULT TRUE,
    sms_enabled BOOLEAN DEFAULT FALSE
);

-- Order summary as materialized view, not denormalized columns
CREATE MATERIALIZED VIEW customer_order_stats AS
SELECT 
    c.id as customer_id,
    MIN(o.created_at) as first_order_date,
    MAX(o.created_at) as last_order_date,
    COUNT(o.id) as total_orders,
    SUM(o.total_amount) as total_spent,
    AVG(o.total_amount) as average_order_value
FROM customers c
LEFT JOIN orders o ON c.id = o.customer_id
GROUP BY c.id;

AI Coding Guidance/Prompt

Prompt: "When adding columns to existing tables:"
Rules:
  - Flag tables with >50 columns for review
  - Reject numbered column names (e.g., address1, address2)
  - Suggest separate tables for optional column groups
  - Warn about mixing different entity types
  
Example:
  # Bad: God table with everything
  CREATE TABLE users (
    id BIGINT PRIMARY KEY,
    name VARCHAR(100),
    email VARCHAR(255),
    phone1 VARCHAR(20),
    phone2 VARCHAR(20),
    phone3 VARCHAR(20),
    home_address VARCHAR(500),
    work_address VARCHAR(500),
    other_address VARCHAR(500),
    preference_theme VARCHAR(50),
    preference_language VARCHAR(10),
    preference_notifications BOOLEAN,
    last_order_id BIGINT,
    last_order_date DATE,
    total_spent DECIMAL(10,2),
    -- ... 100 more columns
  );
  
  # Good: Properly separated entities
  CREATE TABLE users (
    id BIGINT PRIMARY KEY,
    name VARCHAR(100) NOT NULL,
    email VARCHAR(255) NOT NULL UNIQUE
  );
  
  CREATE TABLE user_phones (
    user_id BIGINT REFERENCES users(id),
    phone_type VARCHAR(20),
    phone_number VARCHAR(20),
    PRIMARY KEY (user_id, phone_type)
  );
  
  CREATE TABLE user_addresses (
    id BIGINT PRIMARY KEY,
    user_id BIGINT REFERENCES users(id),
    address_type VARCHAR(20),
    street_address TEXT,
    -- proper address fields
  );

Relevant Keywords

under normalization (god tables) Symptoms: slow queries, data inconsistency, constraint violations Preventive: schema validation, constraint enforcement, proper typing Tech stack: PostgreSQL, MySQL, SQL Server, Oracle Industry: all industries, enterprise, SaaS

Related Patterns

The Cantorian Technical Debt Magnitude scale gives developers an intuitive sense of magnitude beyond simple hour counts - some debts aren't just larger in scale, but qualitatively different in their complexity.

Cantor Points are critical decision junctures—or even moments of non-decision—where seemingly small choices can create drastically divergent futures for a system's integrity, security, and evolvability. These are the "forks in the road" where one path might lead to manageable complexity, while another veers towards systemic entanglement or even chaos. They often appear trivial at the time but can set in motion irreversible or costly-to-reverse consequences.

Applied Data Integrity (ADI) is a framework to understanding the far-reaching consequences of schema and data decisions that impact security and reliability, and accumulate into ethical debt that affects real human lives. Built on research from real-world incidents, ADI uncovered 7 Principles to identify when these decisions are being made, and how to make them better, to avoid future technical debt and potentially catastrophic "butterfly effects" of small decisions that ripple into chaotic technical and ethical debt.