Release & rollback
Standardize release steps, health checks, dashboards, and rollback playbooks—covering canary vs full paths, data migrations, and comms cadence. Suited for ops-oriented agents and on-call runbooks.
The SKILL should list prerequisites (migration scripts, feature flags, dependency versions), verification steps (probes, sampled traffic, control metrics), and rollback order on failure; irreversible steps must name human approval points.
Tie to observability: name core metrics, log keywords, and alert silencing policy; document release windows and code freeze so automation never jumps to 100% unattended at the wrong time.
Pipeline (gates → deploy → verify)
[ merge / tag / pick artifact ]
│
▼
┌─────────────┐ Gates: CI green, image signing, config diff, change ticket
│ Pre-gates │──── optional: human approval, maintenance window check
└─────────────┘
│
▼
┌─────────────┐ Deploy: blue/green / rolling / canary (match policy)
│ Progressive │──── feature flags: safe default → ramp by % or allowlist
└─────────────┘
│
▼
┌─────────────┐ Verify: probes, SLOs, error rate, queue lag, sample cases
│ Observe │──── pass: advance; fail: rollback or mitigate per playbook
└─────────────┘
Principle: “Deploy succeeded” ≠ “business healthy.” After probes pass, observe a stability window (team-defined) before widening traffic or calling the release done.
# Release checklist skeleton (paste into PR description or runbook)
## Pre-release
- [ ] Feature flags configured: new features default OFF in prod
- [ ] DB migrations reviewed: backward compatible, tested on staging
- [ ] Dependencies updated and security scan passed
- [ ] Load test run on staging for traffic-sensitive changes
- [ ] Rollback procedure documented and tested
## Deploy
- [ ] Monitor error rate during rollout (threshold: < 0.5%)
- [ ] Monitor P99 latency (threshold: < 800ms)
- [ ] Business health probes passing (checkout, login, key APIs)
- [ ] On-call engineer notified and available
## Post-release
- [ ] Smoke test passed in production
- [ ] Metrics stable for 30 min observation window
- [ ] Release notes published and stakeholders notified
- [ ] Any temporary monitoring/alerts cleaned up
Prerequisites & gates
- Database: migration order, reversibility, schema/data consistency on rollback.
- Config & secrets: env vars, KMS/secret refs, per-environment diffs reviewed.
- Dependencies: downstream API versions, messaging contracts, cache keyspace compatibility.
- Capacity: expected QPS, pools, batch windows vs peak.
Split app deploys from “irreversible migration” deploys into independently reversible steps; the skill should say which layer to stop at when a step fails.
#!/bin/bash
# Pre-release gate checks
set -euo pipefail
echo "=== Pre-release gate checks ==="
# 1. Required environment variables
for var in DATABASE_URL REDIS_URL API_SECRET_KEY; do
[[ -n "${!var:-}" ]] && echo " OK $var is set" \
|| { echo "FAIL $var is not set" >&2; exit 1; }
done
# 2. Secret access (AWS Secrets Manager)
aws secretsmanager get-secret-value \
--secret-id prod/myapp/db-password \
--query SecretString --output text > /dev/null \
&& echo " OK secrets accessible" \
|| { echo "FAIL cannot access secrets" >&2; exit 1; }
# 3. Downstream API health checks
for svc in "https://payment-svc.internal/health" "https://auth-svc.internal/health"; do
status=$(curl -sf -o /dev/null -w "%{http_code}" "$svc")
[[ "$status" == "200" ]] && echo " OK $svc" \
|| { echo "FAIL $svc returned $status" >&2; exit 1; }
done
echo "=== All pre-release gates passed ==="
Health checks & observability
Probes & synthetic
- HTTP/TCP readiness & liveness thresholds
- Critical-path synthetics (login, checkout, payment callbacks, …)
- Queue/topic lag and DLQ counts
Metrics & logs
- Error rate, latency percentiles, saturation (CPU/mem/connections)
- Business KPIs (conversion, job success rate)
- Trace/log keywords tied to the release for fast search
#!/bin/bash
# Post-deploy health checks
set -euo pipefail
BASE="https://api.example.com"
echo "=== Post-deploy health checks ==="
# Readiness probe
status=$(curl -sf -o /dev/null -w "%{http_code}" "$BASE/healthz/ready")
[[ "$status" == "200" ]] && echo " OK readiness probe" \
|| { echo "FAIL readiness probe returned $status" >&2; exit 1; }
# Liveness probe
status=$(curl -sf -o /dev/null -w "%{http_code}" "$BASE/healthz/live")
[[ "$status" == "200" ]] && echo " OK liveness probe" \
|| { echo "FAIL liveness probe returned $status" >&2; exit 1; }
# Error rate (Prometheus)
err_rate=$(curl -sf \
"http://prometheus:9090/api/v1/query?query=rate(http_requests_total{status=~\"5..\"}[5m])/rate(http_requests_total[5m])" \
| jq -r '.data.result[0].value[1] // "0"')
echo " Current 5xx error rate: $err_rate (threshold 0.005)"
echo "=== Health checks PASSED ==="
Grayscale, canary & full
Provide separate checklists for grayscale/canary vs full rollout—avoid unattended jumps to 100%.
- Grayscale: instance share or traffic weight, dwell time per stage, auto vs manual promotion.
- Canary: compare to a control on the same dashboard; anomalies auto-revert or trip breakers.
- Full: name who declares “done” and when external comms go out (if applicable).
# Nginx canary weight configuration
# Step 1: 5% to canary
split_clients "${remote_addr}" $app_variant {
5% canary;
* stable;
}
# Step 2: After 10 min, increase to 20%
split_clients "${remote_addr}" $app_variant {
20% canary;
* stable;
}
# Reload: sudo nginx -t && sudo nginx -s reload
# Argo Rollouts progressive delivery
# Deploy new image and start canary
kubectl argo rollouts set image my-rollout \
my-container=myrepo/myapp:v1.4.2 -n production
# Check rollout status
kubectl argo rollouts status my-rollout -n production
# Manually promote to next step (if auto-pause is configured)
kubectl argo rollouts promote my-rollout -n production
# Abort and rollback on failure
kubectl argo rollouts abort my-rollout -n production
Rollback order & data impact
[ trigger: SLO breach / fatal defect / human stop ]
│
▼
┌─────────────┐ First mitigate: throttle, flip flags, disable batch jobs
│ Traffic │
└─────────────┘
│
▼
┌─────────────┐ App: roll back to last good artifact (or blue/green swap)
│ Artifact │──── Data: if migration irreversible → compensation playbook, not only “deploy old binary”
└─────────────┘
│
▼
┌─────────────┐ Verify: probes + core business samples + alert recovery
│ Post-RB │──── Comms: internal channel + status page (if external)
└─────────────┘
- Whether app rollback needs migration rollback; if irreversible, mark approval and data-fix path.
- Cache & search indexes: rebuild vs TTL expiry—avoid old code reading new data shapes.
#!/bin/bash
# Rollback execution script
set -euo pipefail
PREV_VERSION="${1:?Usage: rollback.sh <previous-version>}"
NAMESPACE="production"
DEPLOYMENT="myapp"
echo "=== Initiating rollback to $PREV_VERSION ==="
# 1. Rollback Kubernetes deployment
kubectl set image deployment/$DEPLOYMENT \
$DEPLOYMENT=myrepo/myapp:$PREV_VERSION -n $NAMESPACE
# 2. Wait for rollback to complete
kubectl rollout status deployment/$DEPLOYMENT -n $NAMESPACE --timeout=5m
# 3. Verify health after rollback
sleep 30
status=$(curl -sf -o /dev/null -w "%{http_code}" https://api.example.com/healthz/ready)
[[ "$status" == "200" ]] && echo " OK service healthy after rollback" \
|| echo "WARNING: health check returned $status - investigate"
echo "=== Rollback to $PREV_VERSION completed ==="
# Slack rollback notification template (JSON)
{
"text": ":rotating_light: *[ROLLBACK]* Production rollback in progress",
"blocks": [
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": "*Service:* myapp\n*Rolled back to:* v1.4.1\n*Reason:* Error rate spike (3.2% > 0.5% threshold)\n*Engineer:* @on-call\n*Status:* Rolling back..."
}
},
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": "*Runbook:* <https://wiki.example.com/rollback-runbook|Rollback Runbook>\n*Dashboard:* <https://grafana.example.com/d/release|Release Dashboard>"
}
}
]
}
Comms, windows & freeze
- Internal: change summary, blast radius, on-call & escalation, expected rollback time.
- External: status page timing, maintenance templates (if applicable).
- Freeze: no-auto-merge / no-auto-publish windows and exception flow (hotfix).
Runbook snippet lab
Pick a strategy and check what this change touches to generate a Markdown snippet for tickets or agent context; the release name is used in the title—adjust to your repo convention.
Output is a skeleton—add concrete thresholds, owners, ticket links, and rollback commands. For “full” strategy, the SKILL must include an explicit human confirmation step.
---
name: release-and-rollback
description: Standardized release pipeline, health gates, rollback runbook, and communications
version: 2.0
---
# Pre-release gates
- Verify: required env vars set, secrets accessible, downstream APIs healthy
- DB migrations validated on production-like snapshot
- Feature flags: new features default OFF in production
- Checklist signed off before deploy starts
# Deploy and observe
- Use canary or rolling: never big-bang to 100% in one step
- Monitor: error rate < 0.5%, P99 < 800ms, business probes green
- Hold at each stage for at least 10 min before advancing
- Keep rollback command ready throughout observation window
# Rollback procedure
- Trigger: error rate > threshold OR P99 > threshold OR business probe failure
- Execute: kubectl rollout undo OR set image to previous version
- Notify: post to Slack/PagerDuty within 2 min of decision
- Retain failed version artifacts for post-mortem
# Post-release
- Smoke test in production after stable period
- Update status page and notify stakeholders
- File incident report if rollback was triggered
- Schedule retro within 48h of any significant incident