SWE-3002

JSON Blob Misuse in Relational Database

Cantorian Technical Debt Magnitude: ℵ₀ (Countable)

Description

Storing significant structured data as JSON blobs inside relational database tables instead of using proper columns and relationships. While modern databases support JSON, overuse transforms the relational database into a slow document store without proper indexing, constraints, or query optimization capabilities.

Illustrative Cantor Point

The Cantor Point occurs when choosing to store structured data as JSON "to avoid schema migrations." This convenience-driven decision creates an opaque data structure that becomes increasingly difficult to query, validate, and maintain as the JSON schemas proliferate.

Categories: CP-Schema, CP-API, CP-Process

Real-World Examples / Observed In

  • SaaS Platform: User preferences JSON blob grew to contain 50+ keys, making preference queries require full table scans
  • E-commerce Site: Product attributes in JSON made "filter by color" queries impossibly slow
  • Analytics System: Event data in JSON blobs reached 100+ different schemas, becoming unmaintainable

Common Consequences & Impacts

Technical Impacts

  • - Poor query performance (full scans of JSON)
  • - No data integrity enforcement
  • - Difficult indexing strategies
  • - Complex application-level validation

Human/Ethical Impacts

  • - User data integrity compromised
  • - Developer frustration with query complexity
  • - Hidden bugs in unvalidated data
  • - Technical debt impeding progress

Business Impacts

  • - Slow feature development
  • - Poor user experience from slow queries
  • - Data quality issues
  • - Difficult reporting and analytics

Recovery Difficulty & Escalation

7.5
6

ADI Principles & Axioms Violated

  • Principle of Invisible Decay: Schema chaos hidden inside JSON blobs
  • Principle of Performative Integrity: Appears flexible but performs poorly
  • Principle of Emergent Transformation: Convenience transforms into complexity

Detection / 60-Second Audit

-- Find tables with JSON columns
SELECT 
    table_name,
    column_name,
    data_type,
    pg_size_pretty(pg_relation_size(table_name::regclass)) as table_size
FROM information_schema.columns
WHERE table_schema = 'public'
AND data_type IN ('json', 'jsonb')
ORDER BY pg_relation_size(table_name::regclass) DESC;

-- Sample JSON keys to understand structure
SELECT 
    jsonb_object_keys(settings) as key,
    COUNT(*) as usage_count
FROM users
WHERE settings IS NOT NULL
GROUP BY key
ORDER BY usage_count DESC
LIMIT 20;

-- Check query performance on JSON fields
EXPLAIN (ANALYZE, BUFFERS)
SELECT COUNT(*) FROM users 
WHERE settings->>'theme' = 'dark';
-- Find tables with JSON columns
SELECT 
    table_name,
    column_name,
    data_type,
    (SELECT COUNT(*) FROM information_schema.tables t WHERE t.table_name = c.table_name) as row_count_estimate
FROM information_schema.columns c
WHERE table_schema = DATABASE()
AND data_type = 'json';

-- Check JSON search performance
EXPLAIN
SELECT COUNT(*) FROM users
WHERE JSON_EXTRACT(settings, '$.theme') = 'dark';

-- Analyze JSON key usage
SELECT 
    JSON_KEYS(settings) as keys_list,
    COUNT(*) as occurrence
FROM users
WHERE settings IS NOT NULL
GROUP BY keys_list
ORDER BY occurrence DESC
LIMIT 10;
-- Find tables with JSON data (stored as NVARCHAR)
SELECT 
    t.name as table_name,
    c.name as column_name,
    TYPE_NAME(c.user_type_id) as data_type,
    p.rows as row_count
FROM sys.columns c
JOIN sys.tables t ON c.object_id = t.object_id
JOIN sys.partitions p ON t.object_id = p.object_id
WHERE c.name LIKE '%json%' 
OR c.name LIKE '%settings%'
OR c.name LIKE '%data%'
AND TYPE_NAME(c.user_type_id) IN ('nvarchar', 'varchar')
ORDER BY p.rows DESC;

-- Test JSON query performance
SET STATISTICS TIME ON;
SET STATISTICS IO ON;

SELECT COUNT(*) 
FROM users
WHERE JSON_VALUE(settings, '$.theme') = 'dark';

Prevention & Mitigation Best Practices

  1. Structured First: Use columns for queryable data
  2. JSON for Truly Dynamic: Reserve for genuinely unpredictable data
  3. Schema Validation: Enforce JSON schema at application level
  4. Strategic Indexing: Use functional indexes on JSON keys
  5. Regular Extraction: Promote common JSON fields to columns
  6. Document Stores: Consider MongoDB for document-centric data
  7. Hybrid Approach: Columns for queries, JSON for display-only data

Real World Examples

-- Started simple...
CREATE TABLE users (
    id BIGINT PRIMARY KEY,
    email VARCHAR(255),
    preferences JSON
);

-- Preferences grew wild:
-- {"theme": "dark", "language": "en", "notifications": {...}, 
--  "privacy": {...}, "display": {...}, "features": {...}}

-- Now every preference check is a full table scan:
SELECT * FROM users 
WHERE JSON_EXTRACT(preferences, '$.notifications.email') = true;
-- Result: 5-second queries on 100k users
-- E-commerce "flexible" schema
CREATE TABLE products (
    id BIGINT PRIMARY KEY,
    sku VARCHAR(50),
    attributes JSONB  -- Color, size, material, brand, etc.
);

-- Filtering becomes impossible:
SELECT * FROM products
WHERE attributes->>'color' = 'red'
AND (attributes->>'size' = 'M' OR attributes->>'size' = 'Medium')
AND CAST(attributes->>'price' AS DECIMAL) < 50;
-- No indexes possible, inconsistent data formats
-- Extract queryable fields to columns
CREATE TABLE users (
    id BIGINT PRIMARY KEY,
    email VARCHAR(255) UNIQUE NOT NULL,
    theme VARCHAR(20) DEFAULT 'light',
    language VARCHAR(5) DEFAULT 'en',
    notifications_email BOOLEAN DEFAULT true,
    notifications_push BOOLEAN DEFAULT false,
    -- Non-queryable display preferences stay in JSON
    display_preferences JSONB
);

-- Proper indexes for common queries
CREATE INDEX idx_users_theme ON users(theme);
CREATE INDEX idx_users_notifications ON users(notifications_email, notifications_push);

-- Products with structured attributes
CREATE TABLE products (
    id BIGINT PRIMARY KEY,
    sku VARCHAR(50) UNIQUE,
    color VARCHAR(50),
    size VARCHAR(10),
    price DECIMAL(10,2),
    -- Only truly dynamic attributes in JSON
    extended_attributes JSONB
);

CREATE INDEX idx_products_filters ON products(color, size, price);

-- Now queries are fast:
SELECT * FROM products
WHERE color = 'red' 
AND size IN ('M', 'Medium')
AND price < 50;
-- Uses index, returns in milliseconds

AI Coding Guidance/Prompt

Prompt: "When considering JSON storage in relational database:"
Rules:
  - Reject JSON for data that will be queried
  - Require justification for JSON columns
  - Suggest columns for any searchable fields
  - Flag JSON with >10 keys for review
  
Example:
  # Bad: Everything in JSON
  CREATE TABLE users (
    id BIGINT PRIMARY KEY,
    data JSONB  -- Contains name, email, preferences, settings, history...
  );
  
  -- Slow query:
  SELECT * FROM users WHERE data->>'email' = 'user@example.com';
  
  # Good: Structured core data, JSON for truly dynamic
  CREATE TABLE users (
    id BIGINT PRIMARY KEY,
    email VARCHAR(255) NOT NULL UNIQUE,
    name VARCHAR(100) NOT NULL,
    preferences JSONB  -- Only for display preferences
  );
  
  -- Fast query with index:
  CREATE INDEX idx_users_email ON users(email);
  SELECT * FROM users WHERE email = 'user@example.com';
  
  -- If needed, functional index on JSON:
  CREATE INDEX idx_users_theme ON users((preferences->>'theme'));

Relevant Keywords

json blob misuse relational database index performance query optimization Symptoms: slow queries, data inconsistency, constraint violations Preventive: schema validation, constraint enforcement, proper typing Tech stack: PostgreSQL, MySQL, SQL Server, Oracle Industry: all industries, enterprise, SaaS

Related Patterns

The Cantorian Technical Debt Magnitude scale gives developers an intuitive sense of magnitude beyond simple hour counts - some debts aren't just larger in scale, but qualitatively different in their complexity.

Cantor Points are critical decision junctures—or even moments of non-decision—where seemingly small choices can create drastically divergent futures for a system's integrity, security, and evolvability. These are the "forks in the road" where one path might lead to manageable complexity, while another veers towards systemic entanglement or even chaos. They often appear trivial at the time but can set in motion irreversible or costly-to-reverse consequences.

Applied Data Integrity (ADI) is a framework to understanding the far-reaching consequences of schema and data decisions that impact security and reliability, and accumulate into ethical debt that affects real human lives. Built on research from real-world incidents, ADI uncovered 7 Principles to identify when these decisions are being made, and how to make them better, to avoid future technical debt and potentially catastrophic "butterfly effects" of small decisions that ripple into chaotic technical and ethical debt.