Database migration

Have agents turn DDL, backfills, and release ordering into an auditable multi-phase plan—reducing table locks and risky dual-write windows.

Prefer expand–contract for large tables: add column → backfill → switch reads → switch writes → drop old column; the SKILL must list expected locks, row estimates, and whether work runs in a maintenance window.

Every migration ships with down/rollback notes—not only reverse SQL, but how app versions and feature flags coordinate rollback.

Align with your ORM/migration tool: filenames, dependency order, and whether long statements run inside a transaction—encode team rules in the prompt.

  • Indexes: prefer CONCURRENTLY (if supported) or phased builds to avoid blocking writes.
  • Defaults: for large NOT NULL + default, backfill in stages before tightening the constraint.
  • Observation: row counts before/after, sampled checks, slow-query alert thresholds.

Expand–contract flow

  [ DDL: add column (nullable or safe default) ]
                    │
                    ▼
         [ Backfill: batched, rate-limited, retryable jobs ]
                    │
                    ▼
    [ Switch reads: dual-write or read new column; keep old ]
                    │
                    ▼
         [ Switch writes: app writes new path only; watch errors ]
                    │
                    ▼
  [ Tighten: NOT NULL / UNIQUE / drop old column ]

Document per phase rollback: e.g. revert app reads while keeping the new column—avoid undocumented half-states.

-- Phase 1: Add nullable column (no table lock, PostgreSQL)
ALTER TABLE orders ADD COLUMN status_v2 VARCHAR(32) DEFAULT NULL;

-- Phase 2: Batch backfill to avoid long-running row locks
DO $$
DECLARE
  batch_size INT := 1000;
  updated    INT;
BEGIN
  LOOP
    UPDATE orders
    SET    status_v2 = status::VARCHAR     -- map from old column
    WHERE  status_v2 IS NULL
    LIMIT  batch_size;

    GET DIAGNOSTICS updated = ROW_COUNT;
    EXIT WHEN updated = 0;
    PERFORM pg_sleep(0.05);               -- rate limit: 50ms interval
  END LOOP;
END $$;

-- Progress check: count remaining unbackfilled rows
SELECT COUNT(*) AS remaining
FROM   orders
WHERE  status_v2 IS NULL;

-- Phase 3: Tighten constraint, then drop old column
-- Verify no NULLs first (otherwise NOT NULL will error)
ALTER TABLE orders ALTER COLUMN status_v2 SET NOT NULL;
-- After new code is deployed reading/writing status_v2:
ALTER TABLE orders DROP COLUMN status;
ALTER TABLE orders RENAME COLUMN status_v2 TO status;
-- Flyway SQL migration file: V20240315_01__add_order_status_v2.sql
-- (filename format: V{version}__{description}.sql)

-- up migration
ALTER TABLE orders ADD COLUMN IF NOT EXISTS status_v2 VARCHAR(32);

-- Liquibase YAML format (db/changelog/2024/03/add_status_v2.yaml)
-- databaseChangeLog:
--   - changeSet:
--       id: "2024031501"
--       author: "dev"
--       changes:
--         - addColumn:
--             tableName: orders
--             columns:
--               - column:
--                   name: status_v2
--                   type: VARCHAR(32)
--       rollback:
--         - dropColumn:
--             tableName: orders
--             columnName: status_v2
# Alembic migration file (Python / SQLAlchemy)
# alembic/versions/2024031501_add_order_status_v2.py

from alembic import op
import sqlalchemy as sa

revision = '2024031501'
down_revision = '2024030101'

def upgrade():
    op.add_column('orders',
        sa.Column('status_v2', sa.String(32), nullable=True))

def downgrade():
    op.drop_column('orders', 'status_v2')

Rollback pipeline

  [ Trigger: error rate, constraint failure, business tripwire ]
                    │
                    ▼
         [ App: disable feature flag / roll back deploy ]
                    │
                    ▼
    [ Data: down migration or reversible DDL steps ]
                    │
                    ▼
         [ Verify: row counts, sampling, read-only probes / shadow reads ]
                    │
                    ▼
  [ Record: incident ticket, manual repair or re-run backfill? ]

Test down scripts symmetrically with up; irreversible steps (e.g. DROP without backup) must be flagged for snapshot or deferred execution with explicit approval.

-- Down migration example (symmetric with up)
-- Rollback order: restore old column, then drop new column

-- undo Phase 3: restore old column (assumes not yet dropped)
ALTER TABLE orders ADD COLUMN IF NOT EXISTS status VARCHAR(32);
-- Backfill old column (reverse mapping)
UPDATE orders SET status = status_v2 WHERE status IS NULL;
ALTER TABLE orders ALTER COLUMN status SET NOT NULL;
-- Then drop new column
ALTER TABLE orders DROP COLUMN status_v2;

-- CI verification script (bash): test full up + down round trip
#!/usr/bin/env bash
set -euo pipefail

DB_URL="${TEST_DATABASE_URL:-postgres://localhost/test_db}"

echo "=== Running up migration ==="
flyway -url="$DB_URL" migrate

echo "=== Verify row count after up ==="
ROW_COUNT=$(psql "$DB_URL" -tAc "SELECT COUNT(*) FROM orders WHERE status_v2 IS NOT NULL")
echo "Rows backfilled: $ROW_COUNT"

echo "=== Running down migration ==="
flyway -url="$DB_URL" undo

echo "=== Verify rollback ==="
COLUMN_EXISTS=$(psql "$DB_URL" -tAc   "SELECT COUNT(*) FROM information_schema.columns    WHERE table_name='orders' AND column_name='status_v2'")
[ "$COLUMN_EXISTS" -eq 0 ] && echo "Rollback verified: column removed"   || { echo "FAIL: column still exists"; exit 1; }

Locks & maintenance windows

  • List expected lock types (table, metadata, row-lock escalation) and estimated hold time.
  • Capture deadlock risk when long transactions, bulk UPDATE, and DDL hit the same table.
  • For off-hours/maintenance runs, note who is notified and when to abort.
-- Build large table index with CONCURRENTLY to avoid blocking writes (PostgreSQL)
-- Note: cannot be run inside a transaction block
CREATE INDEX CONCURRENTLY idx_orders_status_v2
    ON orders (status_v2)
    WHERE status_v2 IS NOT NULL;

-- Verify index creation status (is_valid=false means concurrent build failed—clean up and rebuild)
SELECT indexname, indexdef, pg_size_pretty(pg_relation_size(indexname::regclass)) AS size
FROM   pg_indexes
WHERE  tablename = 'orders' AND indexname = 'idx_orders_status_v2';

-- Check for long-running transactions blocking migration (over 5 minutes)
SELECT pid, now() - xact_start AS duration, query
FROM   pg_stat_activity
WHERE  state != 'idle'
  AND  xact_start IS NOT NULL
  AND  now() - xact_start > interval '5 minutes'
ORDER  BY duration DESC;

ORM / migration conventions

  • File naming, version or timestamp ordering, dependency chain (no cycles).
  • Whether the whole migration runs in one transaction; execution order when splitting long statements across files.
  • CI: dry-run migrations against a shadow DB before merge; match production engine version.
# flyway.conf — team conventions example
flyway.url=jdbc:postgresql://localhost:5432/payments
flyway.locations=filesystem:db/migrations
flyway.validateOnMigrate=true
flyway.outOfOrder=false           # forbid out-of-order execution
flyway.baselineOnMigrate=false    # don't auto-baseline new databases
# File naming: V{YYYYMMDD}{seq}__{description}.sql
# Example: V20240315_01__add_order_status_v2.sql

# CI dry-run (validate only, no execution)
flyway -url=jdbc:postgresql://shadow-db/payments        -locations=filesystem:db/migrations        validate

Indexes, defaults, NOT NULL

  • Large-table indexes: CONCURRENTLY / ONLINE or build on replica then cut over (per engine).
  • NOT NULL: nullable + backfill + validate, then ALTER to tighten.
  • Defaults: avoid one step that locks the whole table; separate backfill from constraint enforcement in the plan.
-- PostgreSQL: safely add NOT NULL column in multiple steps
-- Anti-pattern (locks entire table for rewrite):
-- ALTER TABLE orders ADD COLUMN priority INT NOT NULL DEFAULT 0;

-- Step 1: Add nullable column (instant, no lock)
ALTER TABLE orders ADD COLUMN priority INT;

-- Step 2: Set column-level default (PG 11+ metadata operation, no lock)
ALTER TABLE orders ALTER COLUMN priority SET DEFAULT 0;

-- Step 3: Backfill existing rows (in batches)
UPDATE orders SET priority = 0 WHERE priority IS NULL;

-- Step 4: Add NOT VALID constraint (validates new writes only, no history scan)
ALTER TABLE orders ADD CONSTRAINT orders_priority_notnull
    CHECK (priority IS NOT NULL) NOT VALID;

-- Step 5: Asynchronously validate history rows (does not block writes)
ALTER TABLE orders VALIDATE CONSTRAINT orders_priority_notnull;

-- Step 6: Finally tighten to true NOT NULL (PG 12+: fast if check constraint passes)
ALTER TABLE orders ALTER COLUMN priority SET NOT NULL;

Observation & validation

  • Row counts before/after, checksums or sampling; p95 for critical business queries.
  • Slow-query and replication-lag thresholds; pause backfill or scale out when exceeded.
  • Audit log: who ran which migration id and how it links to the release ticket.
-- Pre/post migration row count and sampling verification (PostgreSQL + psql)
-- Record baseline before migration
SELECT 'pre-migration' AS phase,
       COUNT(*)        AS total_rows,
       COUNT(status)   AS old_col_rows,
       NOW()           AS recorded_at
FROM   orders;

-- Validate after migration
SELECT 'post-migration' AS phase,
       COUNT(*)         AS total_rows,
       COUNT(status_v2) AS new_col_rows,
       SUM(CASE WHEN status = status_v2 THEN 1 ELSE 0 END) AS matched,
       NOW()            AS recorded_at
FROM   orders
TABLESAMPLE BERNOULLI(1);   -- 1% sample for fast large-table check

-- Replication lag monitor (pause backfill if over 30s)
SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp()))::INT
    AS replication_lag_secs;

-- Slow query detection (monitor during migration)
SELECT query, calls, mean_exec_time, total_exec_time
FROM   pg_stat_statements
WHERE  mean_exec_time > 1000    -- queries over 1 second
ORDER  BY mean_exec_time DESC
LIMIT  10;
---
name: database-migration-plan
description: Draft zero-downtime, rollback-ready phased DB migration plans
tags: [database, migration, postgresql, zero-downtime]
---
# Expand-Contract Pattern
- Phase 1: ALTER TABLE ADD COLUMN (nullable, no default table lock)
- Phase 2: Batch backfill UPDATE ... LIMIT 1000 WHERE new_col IS NULL + rate limit
- Phase 3: Verify row counts → NOT NULL constraint → switch reads/writes → DROP old column

# Migration Tool Formats
- Flyway: V{YYYYMMDD}{seq}__{desc}.sql (validateOnMigrate=true)
- Liquibase: changeSet id + symmetric rollback block
- Alembic: upgrade() / downgrade() functions with complete revision chain

# Large Table Safety
- Indexes: CREATE INDEX CONCURRENTLY (not inside transaction block)
- NOT NULL: NOT VALID CHECK → VALIDATE → SET NOT NULL
- Batching: pg_sleep rate limit + replication lag monitor + pause on threshold

# Rollback Implementation
- down migration tested symmetrically with up; irreversible steps need snapshot approval
- CI verification: flyway migrate → row count check → flyway undo → column existence check

# Observation
- Pre/post: COUNT(*) + TABLESAMPLE 1% sampling comparison
- Slow queries: pg_stat_statements mean_exec_time > 1s alert
- Replication lag: pg_last_xact_replay_timestamp() > 30s pause backfill

All skills More skills