SWE-3002: JSON Blob Misuse in Relational Database

Description

Storing significant structured data as JSON blobs inside relational database tables instead of using proper columns and relationships. While modern databases support JSON, overuse transforms the relational database into a slow document store without proper indexing, constraints, or query optimization capabilities.

Illustrative Cantor Point

The Cantor Point occurs when choosing to store structured data as JSON "to avoid schema migrations." This convenience-driven decision creates an opaque data structure that becomes increasingly difficult to query, validate, and maintain as the JSON schemas proliferate.

Categories: CP-Schema, CP-API, CP-Process

Real-World Examples / Observed In

SaaS Platform: User preferences JSON blob grew to contain 50+ keys, making preference queries require full table scans
E-commerce Site: Product attributes in JSON made "filter by color" queries impossibly slow
Analytics System: Event data in JSON blobs reached 100+ different schemas, becoming unmaintainable

Common Consequences & Impacts

Technical Impacts

- Poor query performance (full scans of JSON)
- No data integrity enforcement
- Difficult indexing strategies
- Complex application-level validation

Human/Ethical Impacts

- User data integrity compromised
- Developer frustration with query complexity
- Hidden bugs in unvalidated data
- Technical debt impeding progress

Business Impacts

- Slow feature development
- Poor user experience from slow queries
- Data quality issues
- Difficult reporting and analytics

Recovery Difficulty & Escalation

Recovery Difficulty:

7.5

Escalation Risk:

6

ADI Principles & Axioms Violated

Principle of Invisible Decay: Schema chaos hidden inside JSON blobs
Principle of Performative Integrity: Appears flexible but performs poorly
Principle of Emergent Transformation: Convenience transforms into complexity

Detection / 60-Second Audit

-- Find tables with JSON columns
SELECT 
    table_name,
    column_name,
    data_type,
    pg_size_pretty(pg_relation_size(table_name::regclass)) as table_size
FROM information_schema.columns
WHERE table_schema = 'public'
AND data_type IN ('json', 'jsonb')
ORDER BY pg_relation_size(table_name::regclass) DESC;

-- Sample JSON keys to understand structure
SELECT 
    jsonb_object_keys(settings) as key,
    COUNT(*) as usage_count
FROM users
WHERE settings IS NOT NULL
GROUP BY key
ORDER BY usage_count DESC
LIMIT 20;

-- Check query performance on JSON fields
EXPLAIN (ANALYZE, BUFFERS)
SELECT COUNT(*) FROM users 
WHERE settings->>'theme' = 'dark';

-- Find tables with JSON columns
SELECT 
    table_name,
    column_name,
    data_type,
    (SELECT COUNT(*) FROM information_schema.tables t WHERE t.table_name = c.table_name) as row_count_estimate
FROM information_schema.columns c
WHERE table_schema = DATABASE()
AND data_type = 'json';

-- Check JSON search performance
EXPLAIN
SELECT COUNT(*) FROM users
WHERE JSON_EXTRACT(settings, '$.theme') = 'dark';

-- Analyze JSON key usage
SELECT 
    JSON_KEYS(settings) as keys_list,
    COUNT(*) as occurrence
FROM users
WHERE settings IS NOT NULL
GROUP BY keys_list
ORDER BY occurrence DESC
LIMIT 10;

-- Find tables with JSON data (stored as NVARCHAR)
SELECT 
    t.name as table_name,
    c.name as column_name,
    TYPE_NAME(c.user_type_id) as data_type,
    p.rows as row_count
FROM sys.columns c
JOIN sys.tables t ON c.object_id = t.object_id
JOIN sys.partitions p ON t.object_id = p.object_id
WHERE c.name LIKE '%json%' 
OR c.name LIKE '%settings%'
OR c.name LIKE '%data%'
AND TYPE_NAME(c.user_type_id) IN ('nvarchar', 'varchar')
ORDER BY p.rows DESC;

-- Test JSON query performance
SET STATISTICS TIME ON;
SET STATISTICS IO ON;

SELECT COUNT(*) 
FROM users
WHERE JSON_VALUE(settings, '$.theme') = 'dark';

Prevention & Mitigation Best Practices

Structured First: Use columns for queryable data
JSON for Truly Dynamic: Reserve for genuinely unpredictable data
Schema Validation: Enforce JSON schema at application level
Strategic Indexing: Use functional indexes on JSON keys
Regular Extraction: Promote common JSON fields to columns
Document Stores: Consider MongoDB for document-centric data
Hybrid Approach: Columns for queries, JSON for display-only data

Real World Examples

-- Started simple...
CREATE TABLE users (
    id BIGINT PRIMARY KEY,
    email VARCHAR(255),
    preferences JSON
);

-- Preferences grew wild:
-- {"theme": "dark", "language": "en", "notifications": {...}, 
--  "privacy": {...}, "display": {...}, "features": {...}}

-- Now every preference check is a full table scan:
SELECT * FROM users 
WHERE JSON_EXTRACT(preferences, '$.notifications.email') = true;
-- Result: 5-second queries on 100k users

-- E-commerce "flexible" schema
CREATE TABLE products (
    id BIGINT PRIMARY KEY,
    sku VARCHAR(50),
    attributes JSONB  -- Color, size, material, brand, etc.
);

-- Filtering becomes impossible:
SELECT * FROM products
WHERE attributes->>'color' = 'red'
AND (attributes->>'size' = 'M' OR attributes->>'size' = 'Medium')
AND CAST(attributes->>'price' AS DECIMAL) < 50;
-- No indexes possible, inconsistent data formats

-- Extract queryable fields to columns
CREATE TABLE users (
    id BIGINT PRIMARY KEY,
    email VARCHAR(255) UNIQUE NOT NULL,
    theme VARCHAR(20) DEFAULT 'light',
    language VARCHAR(5) DEFAULT 'en',
    notifications_email BOOLEAN DEFAULT true,
    notifications_push BOOLEAN DEFAULT false,
    -- Non-queryable display preferences stay in JSON
    display_preferences JSONB
);

-- Proper indexes for common queries
CREATE INDEX idx_users_theme ON users(theme);
CREATE INDEX idx_users_notifications ON users(notifications_email, notifications_push);

-- Products with structured attributes
CREATE TABLE products (
    id BIGINT PRIMARY KEY,
    sku VARCHAR(50) UNIQUE,
    color VARCHAR(50),
    size VARCHAR(10),
    price DECIMAL(10,2),
    -- Only truly dynamic attributes in JSON
    extended_attributes JSONB
);

CREATE INDEX idx_products_filters ON products(color, size, price);

-- Now queries are fast:
SELECT * FROM products
WHERE color = 'red' 
AND size IN ('M', 'Medium')
AND price < 50;
-- Uses index, returns in milliseconds

AI Coding Guidance/Prompt

Prompt: "When considering JSON storage in relational database:"
Rules:
  - Reject JSON for data that will be queried
  - Require justification for JSON columns
  - Suggest columns for any searchable fields
  - Flag JSON with >10 keys for review
  
Example:
  # Bad: Everything in JSON
  CREATE TABLE users (
    id BIGINT PRIMARY KEY,
    data JSONB  -- Contains name, email, preferences, settings, history...
  );
  
  -- Slow query:
  SELECT * FROM users WHERE data->>'email' = 'user@example.com';
  
  # Good: Structured core data, JSON for truly dynamic
  CREATE TABLE users (
    id BIGINT PRIMARY KEY,
    email VARCHAR(255) NOT NULL UNIQUE,
    name VARCHAR(100) NOT NULL,
    preferences JSONB  -- Only for display preferences
  );
  
  -- Fast query with index:
  CREATE INDEX idx_users_email ON users(email);
  SELECT * FROM users WHERE email = 'user@example.com';
  
  -- If needed, functional index on JSON:
  CREATE INDEX idx_users_theme ON users((preferences->>'theme'));

Relevant Keywords

json blob misuse relational database index performance query optimization Symptoms: slow queries, data inconsistency, constraint violations Preventive: schema validation, constraint enforcement, proper typing Tech stack: PostgreSQL, MySQL, SQL Server, Oracle Industry: all industries, enterprise, SaaS