Release & rollback

Standardize release steps, health checks, dashboards, and rollback playbooks—covering canary vs full paths, data migrations, and comms cadence. Suited for ops-oriented agents and on-call runbooks.

The SKILL should list prerequisites (migration scripts, feature flags, dependency versions), verification steps (probes, sampled traffic, control metrics), and rollback order on failure; irreversible steps must name human approval points.

Tie to observability: name core metrics, log keywords, and alert silencing policy; document release windows and code freeze so automation never jumps to 100% unattended at the wrong time.

Pipeline (gates → deploy → verify)

  [ merge / tag / pick artifact ]
        │
        ▼
  ┌─────────────┐     Gates: CI green, image signing, config diff, change ticket
  │  Pre-gates   │──── optional: human approval, maintenance window check
  └─────────────┘
        │
        ▼
  ┌─────────────┐     Deploy: blue/green / rolling / canary (match policy)
  │  Progressive │──── feature flags: safe default → ramp by % or allowlist
  └─────────────┘
        │
        ▼
  ┌─────────────┐     Verify: probes, SLOs, error rate, queue lag, sample cases
  │  Observe     │──── pass: advance; fail: rollback or mitigate per playbook
  └─────────────┘

Principle: “Deploy succeeded” ≠ “business healthy.” After probes pass, observe a stability window (team-defined) before widening traffic or calling the release done.

# Release checklist skeleton (paste into PR description or runbook)
## Pre-release
- [ ] Feature flags configured: new features default OFF in prod
- [ ] DB migrations reviewed: backward compatible, tested on staging
- [ ] Dependencies updated and security scan passed
- [ ] Load test run on staging for traffic-sensitive changes
- [ ] Rollback procedure documented and tested

## Deploy
- [ ] Monitor error rate during rollout (threshold: < 0.5%)
- [ ] Monitor P99 latency (threshold: < 800ms)
- [ ] Business health probes passing (checkout, login, key APIs)
- [ ] On-call engineer notified and available

## Post-release
- [ ] Smoke test passed in production
- [ ] Metrics stable for 30 min observation window
- [ ] Release notes published and stakeholders notified
- [ ] Any temporary monitoring/alerts cleaned up

Prerequisites & gates

Database: migration order, reversibility, schema/data consistency on rollback.
Config & secrets: env vars, KMS/secret refs, per-environment diffs reviewed.
Dependencies: downstream API versions, messaging contracts, cache keyspace compatibility.
Capacity: expected QPS, pools, batch windows vs peak.

Split app deploys from “irreversible migration” deploys into independently reversible steps; the skill should say which layer to stop at when a step fails.

#!/bin/bash
# Pre-release gate checks
set -euo pipefail
echo "=== Pre-release gate checks ==="

# 1. Required environment variables
for var in DATABASE_URL REDIS_URL API_SECRET_KEY; do
  [[ -n "${!var:-}" ]] && echo "  OK  $var is set" \
    || { echo "FAIL $var is not set" >&2; exit 1; }
done

# 2. Secret access (AWS Secrets Manager)
aws secretsmanager get-secret-value \
  --secret-id prod/myapp/db-password \
  --query SecretString --output text > /dev/null \
  && echo "  OK  secrets accessible" \
  || { echo "FAIL cannot access secrets" >&2; exit 1; }

# 3. Downstream API health checks
for svc in "https://payment-svc.internal/health" "https://auth-svc.internal/health"; do
  status=$(curl -sf -o /dev/null -w "%{http_code}" "$svc")
  [[ "$status" == "200" ]] && echo "  OK  $svc" \
    || { echo "FAIL $svc returned $status" >&2; exit 1; }
done

echo "=== All pre-release gates passed ==="

Health checks & observability

Probes & synthetic

HTTP/TCP readiness & liveness thresholds
Critical-path synthetics (login, checkout, payment callbacks, …)
Queue/topic lag and DLQ counts

Metrics & logs

Error rate, latency percentiles, saturation (CPU/mem/connections)
Business KPIs (conversion, job success rate)
Trace/log keywords tied to the release for fast search

#!/bin/bash
# Post-deploy health checks
set -euo pipefail
BASE="https://api.example.com"
echo "=== Post-deploy health checks ==="

# Readiness probe
status=$(curl -sf -o /dev/null -w "%{http_code}" "$BASE/healthz/ready")
[[ "$status" == "200" ]] && echo "  OK  readiness probe" \
  || { echo "FAIL readiness probe returned $status" >&2; exit 1; }

# Liveness probe
status=$(curl -sf -o /dev/null -w "%{http_code}" "$BASE/healthz/live")
[[ "$status" == "200" ]] && echo "  OK  liveness probe" \
  || { echo "FAIL liveness probe returned $status" >&2; exit 1; }

# Error rate (Prometheus)
err_rate=$(curl -sf \
  "http://prometheus:9090/api/v1/query?query=rate(http_requests_total{status=~\"5..\"}[5m])/rate(http_requests_total[5m])" \
  | jq -r '.data.result[0].value[1] // "0"')
echo "  Current 5xx error rate: $err_rate (threshold 0.005)"

echo "=== Health checks PASSED ==="

Grayscale, canary & full

Provide separate checklists for grayscale/canary vs full rollout—avoid unattended jumps to 100%.

Grayscale: instance share or traffic weight, dwell time per stage, auto vs manual promotion.
Canary: compare to a control on the same dashboard; anomalies auto-revert or trip breakers.
Full: name who declares “done” and when external comms go out (if applicable).

# Nginx canary weight configuration
# Step 1: 5% to canary
split_clients "${remote_addr}" $app_variant {
    5%   canary;
    *    stable;
}

# Step 2: After 10 min, increase to 20%
split_clients "${remote_addr}" $app_variant {
    20%  canary;
    *    stable;
}
# Reload: sudo nginx -t && sudo nginx -s reload

# Argo Rollouts progressive delivery
# Deploy new image and start canary
kubectl argo rollouts set image my-rollout \
  my-container=myrepo/myapp:v1.4.2 -n production

# Check rollout status
kubectl argo rollouts status my-rollout -n production

# Manually promote to next step (if auto-pause is configured)
kubectl argo rollouts promote my-rollout -n production

# Abort and rollback on failure
kubectl argo rollouts abort my-rollout -n production

Rollback order & data impact

  [ trigger: SLO breach / fatal defect / human stop ]
        │
        ▼
  ┌─────────────┐     First mitigate: throttle, flip flags, disable batch jobs
  │  Traffic     │
  └─────────────┘
        │
        ▼
  ┌─────────────┐     App: roll back to last good artifact (or blue/green swap)
  │  Artifact    │──── Data: if migration irreversible → compensation playbook, not only “deploy old binary”
  └─────────────┘
        │
        ▼
  ┌─────────────┐     Verify: probes + core business samples + alert recovery
  │  Post-RB     │──── Comms: internal channel + status page (if external)
  └─────────────┘

Whether app rollback needs migration rollback; if irreversible, mark approval and data-fix path.
Cache & search indexes: rebuild vs TTL expiry—avoid old code reading new data shapes.

#!/bin/bash
# Rollback execution script
set -euo pipefail
PREV_VERSION="${1:?Usage: rollback.sh <previous-version>}"
NAMESPACE="production"
DEPLOYMENT="myapp"

echo "=== Initiating rollback to $PREV_VERSION ==="

# 1. Rollback Kubernetes deployment
kubectl set image deployment/$DEPLOYMENT \
  $DEPLOYMENT=myrepo/myapp:$PREV_VERSION -n $NAMESPACE

# 2. Wait for rollback to complete
kubectl rollout status deployment/$DEPLOYMENT -n $NAMESPACE --timeout=5m

# 3. Verify health after rollback
sleep 30
status=$(curl -sf -o /dev/null -w "%{http_code}" https://api.example.com/healthz/ready)
[[ "$status" == "200" ]] && echo "  OK  service healthy after rollback" \
  || echo "WARNING: health check returned $status - investigate"

echo "=== Rollback to $PREV_VERSION completed ==="

# Slack rollback notification template (JSON)
{
  "text": ":rotating_light: *[ROLLBACK]* Production rollback in progress",
  "blocks": [
    {
      "type": "section",
      "text": {
        "type": "mrkdwn",
        "text": "*Service:* myapp\n*Rolled back to:* v1.4.1\n*Reason:* Error rate spike (3.2% > 0.5% threshold)\n*Engineer:* @on-call\n*Status:* Rolling back..."
      }
    },
    {
      "type": "section",
      "text": {
        "type": "mrkdwn",
        "text": "*Runbook:* <https://wiki.example.com/rollback-runbook|Rollback Runbook>\n*Dashboard:* <https://grafana.example.com/d/release|Release Dashboard>"
      }
    }
  ]
}

Comms, windows & freeze

Internal: change summary, blast radius, on-call & escalation, expected rollback time.
External: status page timing, maintenance templates (if applicable).
Freeze: no-auto-merge / no-auto-publish windows and exception flow (hotfix).

Runbook snippet lab

Pick a strategy and check what this change touches to generate a Markdown snippet for tickets or agent context; the release name is used in the title—adjust to your repo convention.

Release strategy

This change touches (check to add checklist items)

Database migration Feature flags Static assets / CDN Queues / consumers Cache keys / invalidation Secrets / config

Release name or version (title)

Output is a skeleton—add concrete thresholds, owners, ticket links, and rollback commands. For “full” strategy, the SKILL must include an explicit human confirmation step.

---
name: release-and-rollback
description: Standardized release pipeline, health gates, rollback runbook, and communications
version: 2.0
---
# Pre-release gates
- Verify: required env vars set, secrets accessible, downstream APIs healthy
- DB migrations validated on production-like snapshot
- Feature flags: new features default OFF in production
- Checklist signed off before deploy starts

# Deploy and observe
- Use canary or rolling: never big-bang to 100% in one step
- Monitor: error rate < 0.5%, P99 < 800ms, business probes green
- Hold at each stage for at least 10 min before advancing
- Keep rollback command ready throughout observation window

# Rollback procedure
- Trigger: error rate > threshold OR P99 > threshold OR business probe failure
- Execute: kubectl rollout undo OR set image to previous version
- Notify: post to Slack/PagerDuty within 2 min of decision
- Retain failed version artifacts for post-mortem

# Post-release
- Smoke test in production after stable period
- Update status page and notify stakeholders
- File incident report if rollback was triggered
- Schedule retro within 48h of any significant incident

Back to skills More skills