Blue-green deployment

Run blue (current) and green (candidate) as equivalent stacks; flip traffic via load balancer or DNS. Good for agent-generated checklists and forward-compatible schema notes.

The skill should cover: deploy new version to green, health checks and smoke tests, switch traffic, watch metrics in the observation window, and on failure roll back to blue while keeping green for diagnosis.

Database and cache changes are often the bottleneck: document expand–contract, whether migrations run before cutover, dual-write or read-replica strategies—avoid both colors writing incompatible schemas.

Main flow (skill-flow-block)

  [ Blue: live traffic │ Green: candidate ]
        │
        ▼
  ┌─────────────┐     Image / config / secrets aligned with blue; readiness green
  │ Deploy green│
  └─────────────┘
        │
        ▼
  ┌─────────────┐     Smoke, contract tests, read-only shadow (optional)
  │ Validate    │──── DB: migration order, dual-write window, read replicas
  └─────────────┘
        │
        ▼
  ┌─────────────┐     LB weights / DNS TTL / gateway routes; drain before raising
  │ Switch      │──── Stateful: stickiness, WebSocket, long-poll policies
  └─────────────┘
        │
        ▼
  ┌─────────────┐     Errors, P99, business probes; miss SLO → roll back blue
  │ Observe     │──── Keep green: logs and repro, do not tear down yet
  └─────────────┘

Each runbook step should be one click or one command; rollback names owners and approval—avoid unaudited “verbal” cutbacks.

Traffic switch

Switching points ingress from blue to green—atomically or gradually—via target group changes, API gateway routes, or DNS (watch TTL and client caches).

Instant: good for stateless HTTP; pair with health checks and bad-instance removal.
Gradual: weights or regions in waves, then 100% after metrics look good—smaller blast radius.
Before switch, confirm config and secrets match across stacks; define “successful switch” numerically (errors, P99, business probes).

# AWS ALB blue-green traffic switch (AWS CLI example)

# 1. Check current target group weights
aws elbv2 describe-rules \
  --listener-arn arn:aws:elasticloadbalancing:us-east-1:123456789:listener/... \
  --query 'Rules[?Priority==`1`].Actions'

# 2. Route 10% traffic to green target group (gradual switch)
aws elbv2 modify-rule \
  --rule-arn arn:aws:elasticloadbalancing:us-east-1:...:rule/... \
  --actions '[
    {"Type":"forward","ForwardConfig":{"TargetGroups":[
      {"TargetGroupArn":"arn:...blue-tg","Weight":90},
      {"TargetGroupArn":"arn:...green-tg","Weight":10}
    ]}}
  ]'

# 3. After 5-minute observation window, full cutover (blue=0, green=100)
aws elbv2 modify-rule \
  --rule-arn arn:aws:elasticloadbalancing:us-east-1:...:rule/... \
  --actions '[
    {"Type":"forward","ForwardConfig":{"TargetGroups":[
      {"TargetGroupArn":"arn:...blue-tg","Weight":0},
      {"TargetGroupArn":"arn:...green-tg","Weight":100}
    ]}}
  ]'

# 4. Rollback (on failure: blue=100, green=0)
# Adjust Weight values and re-run modify-rule

Drain and stateful workloads

Drain: before deregistering from the LB or scaling down, stop accepting new connections and let in-flight requests and long-lived connections finish within a time limit—avoid a burst of 502s or aborted transactions.

HTTP: Connection: close, failing readiness, graceful shutdown timeout.
WebSocket / long-lived: migration at the protocol, broadcast reconnect, or stickiness on old color until drain completes.
Session stickiness: new sessions land on green; old sessions may still be served by blue during drain—document max wait and forced disconnect.

# Nginx upstream blue-green switch configuration example
# /etc/nginx/conf.d/upstream.conf

upstream app_backend {
    # Blue-green switch: toggle the commented line and run: nginx -s reload
    server blue.internal:8080;   # blue (current production)
    # server green.internal:8080;  # green (new version, uncomment to switch)

    keepalive 64;
}

server {
    listen 443 ssl;
    server_name api.example.com;

    # Drain config: mark blue as down during drain period
    # upstream app_backend { server blue.internal:8080 down; server green.internal:8080; }

    location / {
        proxy_pass         http://app_backend;
        proxy_http_version 1.1;
        proxy_set_header   Connection "";
        proxy_read_timeout 120s;  # drain timeout: allow up to 120s for in-flight requests
    }
}
# Zero-downtime hot reload: sudo nginx -t && sudo nginx -s reload

Data and schema compatibility

Use expand–contract: add columns/tables (backward compatible) → dual code versions → backfill → drop old columns. If migrations run after cutover, schemas readable/writable from either color must stay compatible.

Cache: keyspace version prefixes, TTL and thundering herds; avoid divergent semantics per color on the same key.
Queues: consumer versions, retry and DLQ; breaking payload changes need dual consumers or isolated topics.

-- Database Expand-Contract migration pattern (PostgreSQL example)

-- Phase 1: Expand -- add new column (backward compatible; both blue and green can read/write)
ALTER TABLE users ADD COLUMN phone_normalized VARCHAR(20);
CREATE INDEX CONCURRENTLY idx_users_phone_normalized ON users(phone_normalized);
-- Blue (old code) ignores new column; green (new code) writes to new column

-- Phase 2: Backfill data (run after cutover while green is live)
UPDATE users
SET phone_normalized = normalize_phone(phone)
WHERE phone_normalized IS NULL
  AND phone IS NOT NULL;

-- Phase 3: Verify backfill completeness
SELECT COUNT(*) FROM users WHERE phone IS NOT NULL AND phone_normalized IS NULL;
-- Result must be 0 before proceeding to next phase

-- Phase 4: Contract -- drop old column (next release, after green is fully rolled out)
-- ALTER TABLE users DROP COLUMN phone;  -- danger: confirm blue is fully retired first

Observation, success criteria, rollback

During the window compare SLOs across colors or before/after; business probes (checkout, login) beat raw HTTP 200.
Rollback must be one-click in the runbook; keep green for config and log diff.
Post-switch retro: actual drain duration, long-connection handling, DB locks and migrations impacting cutover.

#!/bin/bash
# Smoke test script: verify core paths on green after cutover
set -euo pipefail
BASE="https://api.example.com"

echo "=== Smoke Test: Core functionality check after cutover ==="

# 1. Version check (confirm traffic is hitting the new version)
version=$(curl -sf "$BASE/api/version" | jq -r '.version')
[[ "$version" == "v1.4.2" ]] && echo "  OK  version=$version" \
  || { echo "FAIL expected v1.4.2, got $version" >&2; exit 1; }

# 2. Login flow
token=$(curl -sf -X POST "$BASE/auth/token" \
  -H 'Content-Type: application/json' \
  -d '{"username":"smoke-test@example.com","password":"'$SMOKE_TEST_PASS'"}' \
  | jq -r '.access_token')
[[ -n "$token" ]] && echo "  OK  login token acquired" || { echo "FAIL login failed" >&2; exit 1; }

# 3. Core business endpoint
curl -sf "$BASE/api/v1/products?limit=5" \
  -H "Authorization: Bearer $token" | jq -e '.items | length > 0' \
  && echo "  OK  product list endpoint" || { echo "FAIL product endpoint error" >&2; exit 1; }

# 4. Error rate check (Prometheus)
err_rate=$(curl -sf "http://prometheus:9090/api/v1/query?query=\
rate(http_requests_total{status=~\"5..\"}[5m])/rate(http_requests_total[5m])" \
  | jq -r '.data.result[0].value[1]')
echo "  Current 5xx error rate: $err_rate (threshold 0.01)"

echo "=== Smoke Test PASSED ==="

Switch notes generator

Pick entry plane and strategy to paste “traffic switch notes” into a runbook (not executable); replace console/API names and resource IDs for your stack.

Traffic entry (switch plane)

Load balancer (target group / pool weights) API gateway / Ingress (routes, canary plugins) DNS (A/AAAA/CNAME, TTL and caches)

Switch shape

One-shot (0→100%) Gradual (weights or region waves)

Drain and long-lived connections

Drain old color before switch: no new connections, wait for in-flight WebSocket / SSE / long-poll (document reconnect + stickiness) HTTP session stickiness (cookie / client IP)

Drain timeout (seconds)

Output is a checklist only; real commands depend on cloud CLIs/consoles. DNS changes stack with client and CDN caches; gateway canaries need metric sources and rollback route precedence confirmed.

---
name: blue-green-deploy
description: Blue-green dual-environment cutover, smoke test, and data compatibility strategy
version: 2.0
---
# Deploy green environment
- Image version, config, and secrets aligned with blue
- kubectl rollout status deployment/myapp-green -n production
- Readiness probes healthy: curl -sf https://green.internal/healthz/ready

# Database migration (before cutover)
- Expand: ALTER TABLE to add new columns (backward compatible)
- Verify both blue and green can read/write new schema
- Backfill data: UPDATE ... WHERE new_col IS NULL

# Traffic switch
- AWS ALB: modify-rule to ramp green-tg weight 0→10→100
- Nginx: update upstream weight then nginx -s reload
- DNS: lower TTL to 60s before switching A record

# Validation (smoke test)
- Version check: curl $BASE/api/version | jq .version
- Business flows: login → query → checkout sampling
- Error rate < 0.5%, P99 < 800ms, observe for 15 min

# Rollback
- ALB weight blue=100 green=0 (or revert Nginx upstream)
- Retain green environment for log and config diff
- Record actual drain duration and long-connection impact

All skills More skills