Database migration
Have agents turn DDL, backfills, and release ordering into an auditable multi-phase plan—reducing table locks and risky dual-write windows.
Prefer expand–contract for large tables: add column → backfill → switch reads → switch writes → drop old column; the SKILL must list expected locks, row estimates, and whether work runs in a maintenance window.
Every migration ships with down/rollback notes—not only reverse SQL, but how app versions and feature flags coordinate rollback.
Align with your ORM/migration tool: filenames, dependency order, and whether long statements run inside a transaction—encode team rules in the prompt.
- Indexes: prefer CONCURRENTLY (if supported) or phased builds to avoid blocking writes.
- Defaults: for large NOT NULL + default, backfill in stages before tightening the constraint.
- Observation: row counts before/after, sampled checks, slow-query alert thresholds.
Expand–contract flow
[ DDL: add column (nullable or safe default) ]
│
▼
[ Backfill: batched, rate-limited, retryable jobs ]
│
▼
[ Switch reads: dual-write or read new column; keep old ]
│
▼
[ Switch writes: app writes new path only; watch errors ]
│
▼
[ Tighten: NOT NULL / UNIQUE / drop old column ]
Document per phase rollback: e.g. revert app reads while keeping the new column—avoid undocumented half-states.
-- Phase 1: Add nullable column (no table lock, PostgreSQL)
ALTER TABLE orders ADD COLUMN status_v2 VARCHAR(32) DEFAULT NULL;
-- Phase 2: Batch backfill to avoid long-running row locks
DO $$
DECLARE
batch_size INT := 1000;
updated INT;
BEGIN
LOOP
UPDATE orders
SET status_v2 = status::VARCHAR -- map from old column
WHERE status_v2 IS NULL
LIMIT batch_size;
GET DIAGNOSTICS updated = ROW_COUNT;
EXIT WHEN updated = 0;
PERFORM pg_sleep(0.05); -- rate limit: 50ms interval
END LOOP;
END $$;
-- Progress check: count remaining unbackfilled rows
SELECT COUNT(*) AS remaining
FROM orders
WHERE status_v2 IS NULL;
-- Phase 3: Tighten constraint, then drop old column
-- Verify no NULLs first (otherwise NOT NULL will error)
ALTER TABLE orders ALTER COLUMN status_v2 SET NOT NULL;
-- After new code is deployed reading/writing status_v2:
ALTER TABLE orders DROP COLUMN status;
ALTER TABLE orders RENAME COLUMN status_v2 TO status;
-- Flyway SQL migration file: V20240315_01__add_order_status_v2.sql
-- (filename format: V{version}__{description}.sql)
-- up migration
ALTER TABLE orders ADD COLUMN IF NOT EXISTS status_v2 VARCHAR(32);
-- Liquibase YAML format (db/changelog/2024/03/add_status_v2.yaml)
-- databaseChangeLog:
-- - changeSet:
-- id: "2024031501"
-- author: "dev"
-- changes:
-- - addColumn:
-- tableName: orders
-- columns:
-- - column:
-- name: status_v2
-- type: VARCHAR(32)
-- rollback:
-- - dropColumn:
-- tableName: orders
-- columnName: status_v2
# Alembic migration file (Python / SQLAlchemy)
# alembic/versions/2024031501_add_order_status_v2.py
from alembic import op
import sqlalchemy as sa
revision = '2024031501'
down_revision = '2024030101'
def upgrade():
op.add_column('orders',
sa.Column('status_v2', sa.String(32), nullable=True))
def downgrade():
op.drop_column('orders', 'status_v2')
Rollback pipeline
[ Trigger: error rate, constraint failure, business tripwire ]
│
▼
[ App: disable feature flag / roll back deploy ]
│
▼
[ Data: down migration or reversible DDL steps ]
│
▼
[ Verify: row counts, sampling, read-only probes / shadow reads ]
│
▼
[ Record: incident ticket, manual repair or re-run backfill? ]
Test down scripts symmetrically with up; irreversible steps (e.g. DROP without backup) must be flagged for snapshot or deferred execution with explicit approval.
-- Down migration example (symmetric with up)
-- Rollback order: restore old column, then drop new column
-- undo Phase 3: restore old column (assumes not yet dropped)
ALTER TABLE orders ADD COLUMN IF NOT EXISTS status VARCHAR(32);
-- Backfill old column (reverse mapping)
UPDATE orders SET status = status_v2 WHERE status IS NULL;
ALTER TABLE orders ALTER COLUMN status SET NOT NULL;
-- Then drop new column
ALTER TABLE orders DROP COLUMN status_v2;
-- CI verification script (bash): test full up + down round trip
#!/usr/bin/env bash
set -euo pipefail
DB_URL="${TEST_DATABASE_URL:-postgres://localhost/test_db}"
echo "=== Running up migration ==="
flyway -url="$DB_URL" migrate
echo "=== Verify row count after up ==="
ROW_COUNT=$(psql "$DB_URL" -tAc "SELECT COUNT(*) FROM orders WHERE status_v2 IS NOT NULL")
echo "Rows backfilled: $ROW_COUNT"
echo "=== Running down migration ==="
flyway -url="$DB_URL" undo
echo "=== Verify rollback ==="
COLUMN_EXISTS=$(psql "$DB_URL" -tAc "SELECT COUNT(*) FROM information_schema.columns WHERE table_name='orders' AND column_name='status_v2'")
[ "$COLUMN_EXISTS" -eq 0 ] && echo "Rollback verified: column removed" || { echo "FAIL: column still exists"; exit 1; }
Locks & maintenance windows
- List expected lock types (table, metadata, row-lock escalation) and estimated hold time.
- Capture deadlock risk when long transactions, bulk UPDATE, and DDL hit the same table.
- For off-hours/maintenance runs, note who is notified and when to abort.
-- Build large table index with CONCURRENTLY to avoid blocking writes (PostgreSQL)
-- Note: cannot be run inside a transaction block
CREATE INDEX CONCURRENTLY idx_orders_status_v2
ON orders (status_v2)
WHERE status_v2 IS NOT NULL;
-- Verify index creation status (is_valid=false means concurrent build failed—clean up and rebuild)
SELECT indexname, indexdef, pg_size_pretty(pg_relation_size(indexname::regclass)) AS size
FROM pg_indexes
WHERE tablename = 'orders' AND indexname = 'idx_orders_status_v2';
-- Check for long-running transactions blocking migration (over 5 minutes)
SELECT pid, now() - xact_start AS duration, query
FROM pg_stat_activity
WHERE state != 'idle'
AND xact_start IS NOT NULL
AND now() - xact_start > interval '5 minutes'
ORDER BY duration DESC;
ORM / migration conventions
- File naming, version or timestamp ordering, dependency chain (no cycles).
- Whether the whole migration runs in one transaction; execution order when splitting long statements across files.
- CI: dry-run migrations against a shadow DB before merge; match production engine version.
# flyway.conf — team conventions example
flyway.url=jdbc:postgresql://localhost:5432/payments
flyway.locations=filesystem:db/migrations
flyway.validateOnMigrate=true
flyway.outOfOrder=false # forbid out-of-order execution
flyway.baselineOnMigrate=false # don't auto-baseline new databases
# File naming: V{YYYYMMDD}{seq}__{description}.sql
# Example: V20240315_01__add_order_status_v2.sql
# CI dry-run (validate only, no execution)
flyway -url=jdbc:postgresql://shadow-db/payments -locations=filesystem:db/migrations validate
Indexes, defaults, NOT NULL
- Large-table indexes: CONCURRENTLY / ONLINE or build on replica then cut over (per engine).
- NOT NULL: nullable + backfill + validate, then ALTER to tighten.
- Defaults: avoid one step that locks the whole table; separate backfill from constraint enforcement in the plan.
-- PostgreSQL: safely add NOT NULL column in multiple steps
-- Anti-pattern (locks entire table for rewrite):
-- ALTER TABLE orders ADD COLUMN priority INT NOT NULL DEFAULT 0;
-- Step 1: Add nullable column (instant, no lock)
ALTER TABLE orders ADD COLUMN priority INT;
-- Step 2: Set column-level default (PG 11+ metadata operation, no lock)
ALTER TABLE orders ALTER COLUMN priority SET DEFAULT 0;
-- Step 3: Backfill existing rows (in batches)
UPDATE orders SET priority = 0 WHERE priority IS NULL;
-- Step 4: Add NOT VALID constraint (validates new writes only, no history scan)
ALTER TABLE orders ADD CONSTRAINT orders_priority_notnull
CHECK (priority IS NOT NULL) NOT VALID;
-- Step 5: Asynchronously validate history rows (does not block writes)
ALTER TABLE orders VALIDATE CONSTRAINT orders_priority_notnull;
-- Step 6: Finally tighten to true NOT NULL (PG 12+: fast if check constraint passes)
ALTER TABLE orders ALTER COLUMN priority SET NOT NULL;
Observation & validation
- Row counts before/after, checksums or sampling; p95 for critical business queries.
- Slow-query and replication-lag thresholds; pause backfill or scale out when exceeded.
- Audit log: who ran which migration id and how it links to the release ticket.
-- Pre/post migration row count and sampling verification (PostgreSQL + psql)
-- Record baseline before migration
SELECT 'pre-migration' AS phase,
COUNT(*) AS total_rows,
COUNT(status) AS old_col_rows,
NOW() AS recorded_at
FROM orders;
-- Validate after migration
SELECT 'post-migration' AS phase,
COUNT(*) AS total_rows,
COUNT(status_v2) AS new_col_rows,
SUM(CASE WHEN status = status_v2 THEN 1 ELSE 0 END) AS matched,
NOW() AS recorded_at
FROM orders
TABLESAMPLE BERNOULLI(1); -- 1% sample for fast large-table check
-- Replication lag monitor (pause backfill if over 30s)
SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp()))::INT
AS replication_lag_secs;
-- Slow query detection (monitor during migration)
SELECT query, calls, mean_exec_time, total_exec_time
FROM pg_stat_statements
WHERE mean_exec_time > 1000 -- queries over 1 second
ORDER BY mean_exec_time DESC
LIMIT 10;
---
name: database-migration-plan
description: Draft zero-downtime, rollback-ready phased DB migration plans
tags: [database, migration, postgresql, zero-downtime]
---
# Expand-Contract Pattern
- Phase 1: ALTER TABLE ADD COLUMN (nullable, no default table lock)
- Phase 2: Batch backfill UPDATE ... LIMIT 1000 WHERE new_col IS NULL + rate limit
- Phase 3: Verify row counts → NOT NULL constraint → switch reads/writes → DROP old column
# Migration Tool Formats
- Flyway: V{YYYYMMDD}{seq}__{desc}.sql (validateOnMigrate=true)
- Liquibase: changeSet id + symmetric rollback block
- Alembic: upgrade() / downgrade() functions with complete revision chain
# Large Table Safety
- Indexes: CREATE INDEX CONCURRENTLY (not inside transaction block)
- NOT NULL: NOT VALID CHECK → VALIDATE → SET NOT NULL
- Batching: pg_sleep rate limit + replication lag monitor + pause on threshold
# Rollback Implementation
- down migration tested symmetrically with up; irreversible steps need snapshot approval
- CI verification: flyway migrate → row count check → flyway undo → column existence check
# Observation
- Pre/post: COUNT(*) + TABLESAMPLE 1% sampling comparison
- Slow queries: pg_stat_statements mean_exec_time > 1s alert
- Replication lag: pg_last_xact_replay_timestamp() > 30s pause backfill