Disaster recovery drills
Align RTO/RPO with business commitments; runbooks spell out order, validation, and abort conditions. This page shows a decision flow and generates a timeboxed checklist from a target RTO (minutes) for tabletops and live drills.
The skill should define drill scenarios (single AZ, whole region, major vendor API outage), roles, success criteria, and abort conditions so the exercise does not cause a production incident.
Technical scope includes: DNS / traffic ingress cutover, promoting read replicas or cold DB backups, secrets and config available in the DR region, and backlog handling for batch and messaging.
After the drill, capture: actual duration vs targets, single points of failure, doc and automation gaps, owners and due dates—and have the agent help turn these into trackable items.
RTO / RPO definitions and alignment
RTO (Recovery Time Objective)
Target maximum time from “declaring the event” to “critical business restored within agreed scope.” State measurement start (detection / formal declaration / first runbook step) and end (which SLO, traffic, or tenants count as recovered).
RPO (Recovery Point Objective)
Maximum acceptable data loss window, usually expressed as last consistent backup / replication position. Async replication, batch sync, and manual backfill raise real RPO; the runbook must state “as of what moment” and how reconciliation works.
- Declare RTO/RPO by layer (ingress, app, data, messaging, batch)—avoid one site-wide number hiding weak spots.
- When customers or regulators impose commitments, drill frequency and measured outcomes must trace to compliance matrices or contracts.
- Recovery paths must be validated (including periodic restore drills), not runbook-only.
#!/bin/bash
# DR drill: RTO/RPO timeline recorder
# Usage: source this script, call record_event at each key step
DRILL_START=$(date +%s)
DRILL_LOG="dr-drill-$(date +%Y%m%d-%H%M%S).log"
record_event() {
local label="$1"
local now=$(date +%s)
local elapsed=$(( now - DRILL_START ))
echo "[T+${elapsed}s] $label" | tee -a "$DRILL_LOG"
}
# Example drill sequence:
# record_event "INCIDENT DECLARED - alert fired"
# record_event "ON-CALL RESPONDED - runbook opened"
# record_event "DIAGNOSIS COMPLETE - root cause identified"
# record_event "RECOVERY ACTION STARTED"
# record_event "SERVICE RESTORED - health checks green"
# record_event "INCIDENT CLOSED"
echo "Drill log: $DRILL_LOG"
echo "Target RTO: 30 min | Target RPO: 5 min"
Runbook structure
The runbook is shared by drills and real cutovers: each step names the executor, preconditions, commands or console paths, expected output, and rollback or escalation on failure. When the agent drafts content, require links to dashboards, ticket templates, and on-call owners.
- Triage and severity: which symptoms escalate to DR, who may declare, handoff to incident management (incident ID).
- Dependency topology: ordered DNS, LB, compute, data, secrets, third-party APIs; mark read-only deps and deferrable work.
- Execution: numbered steps, estimated duration, parallelism; explicit human approval where automation is forbidden.
- Validation: probes, canaries, sampled business transactions, data checks—aligned to RTO timeboxes.
- Comms: status page, enterprise customer messaging, internal war room channels and cadence.
- Abort and rollback: when to stop cutover, how to return to the original region or last known good.
- Retrospective: actual RTO/RPO, gap analysis, improvements and due dates.
#!/bin/bash
# Disaster scenario playbooks
# --- Scenario A: Database primary failover ---
db_failover() {
echo "[DR] Starting DB failover procedure..."
# 1. Promote read replica to primary
aws rds failover-db-cluster --db-cluster-identifier myapp-cluster
# 2. Update application connection string
kubectl set env deployment/myapp DB_HOST="$NEW_PRIMARY_ENDPOINT"
# 3. Verify connectivity
kubectl exec -it deploy/myapp -- pg_isready -h "$NEW_PRIMARY_ENDPOINT"
}
# --- Scenario B: Region failover ---
region_failover() {
echo "[DR] Initiating region failover: us-east-1 -> us-west-2..."
# 1. Update Route 53 health check and failover routing
aws route53 change-resource-record-sets \
--hosted-zone-id Z123... \
--change-batch file://failover-record.json
# 2. Scale up standby region
aws autoscaling update-auto-scaling-group \
--auto-scaling-group-name myapp-us-west-2 \
--min-size 3 --desired-capacity 6
}
# --- Scenario C: Data restore from backup ---
restore_from_backup() {
echo "[DR] Restoring from latest snapshot..."
SNAPSHOT_ID=$(aws rds describe-db-snapshots \
--db-instance-identifier myapp-db \
--query 'sort_by(DBSnapshots,&SnapshotCreateTime)[-1].DBSnapshotIdentifier' \
--output text)
aws rds restore-db-instance-from-db-snapshot \
--db-instance-identifier myapp-db-restored \
--db-snapshot-identifier "$SNAPSHOT_ID"
echo "RPO check: $(aws rds describe-db-snapshots \
--db-snapshot-identifier $SNAPSHOT_ID \
--query 'DBSnapshots[0].SnapshotCreateTime' --output text)"
}
DR decision and execution flow
[ Monitoring / human detects anomaly ]
│
▼
┌─────────────┐ Align: blast radius, false positive, declare or not
│ Triage │──── Out: incident owner, initial comms cadence
└─────────────┘
│
▼
┌─────────────┐ Against runbook: RTO/RPO, abort gates, approvals
│ War room │──── Log: executor per step, timestamps, command/ticket links
└─────────────┘
│
▼
┌─────────────┐ DNS / traffic / compute / data in order or parallel (predefined)
│ Execute │──── Secrets & config in target region; queue & batch plan ready
└─────────────┘
│
▼
┌─────────────┐ Probes + business samples + data reconciliation (per RPO)
│ Tech verify │──── Abort or rollback if thresholds missed—no silent “close enough”
└─────────────┘
│
▼
┌─────────────┐ Status page / customer notice / internal sync
│ External │──── Align with legal and enterprise processes if applicable
└─────────────┘
│
▼
┌─────────────┐ Gaps, SPOFs, doc & automation holes → trackable items
│ Drill retro │
└─────────────┘
As with backup skills: recovery must be repeatable; timeboxes are aids only—real approvals and third parties can stretch segments, so call that out in templates.
Drill scenarios and success criteria
- Define success criteria and resource caps separately for single AZ, whole region, and major vendor API outages.
- Comms plan: status page, enterprise notifications, war room channels; during drills avoid sending “real disaster” copy externally (use agreed test markers).
- After drills, log improvements: automation gaps, ambiguous docs, on-call blind spots—with owners and due dates.
# Post-drill retrospective template (5 Whys format)
## DR Drill Retrospective - [Date] [Scenario Name]
### Timeline Summary
| Time (T+) | Event |
|-----------|-------|
| 0s | Incident declared / drill started |
| Xs | On-call responded |
| Xs | Root cause identified |
| Xs | Recovery action started |
| Xs | Service restored |
### Metrics vs Targets
| Metric | Target | Actual | Pass/Fail |
|--------|--------|--------|-----------|
| RTO | 30 min | | |
| RPO | 5 min | | |
### 5 Whys Root Cause Analysis
- Why 1: Why did the incident occur?
- Why 2: Why did that condition exist?
- Why 3: Why was that allowed?
- Why 4: Why wasn't it caught earlier?
- Why 5: Why does the underlying gap exist?
### Action Items
- [ ] Runbook update: [what needs to change]
- [ ] Automation: [what to automate to reduce RTO]
- [ ] Monitoring: [what detection gap to close]
- [ ] Training: [who needs to practice what]
### Sign-off
- Drill owner: ___ | Next drill date: ___
RTO minute timebox checklist
Enter a target RTO (minutes) to generate cumulative deadlines and checkable steps for tabletop timing or printouts. Ratios can be rewritten per org in your runbook; this is a generic template.
Phase deadlines and generated checklist items
Phase ratios: declare 8% · assess 12% · cutover 42% · technical validation 22% · business sign-off 10% · comms wrap-up 6% (100% total). Minutes per segment are floored by ratio then remainder distributed so segments sum exactly to target RTO; very short RTOs may yield 0-minute segments—merge with neighbors in practice.
---
name: disaster-recovery
description: DR drill runbook, RTO/RPO tracking, and post-drill retrospective
version: 2.0
---
# Define recovery targets
- RTO: max acceptable downtime per service tier (e.g. Tier 1 = 15 min)
- RPO: max acceptable data loss (e.g. Tier 1 = 5 min)
- Document targets in SLA and review with stakeholders quarterly
# Runbook structure
- Trigger conditions: when to invoke this runbook
- Decision tree: how to diagnose and choose recovery path
- Step-by-step recovery commands with expected outputs
- Rollback path if recovery fails
- Owner + last-tested date
# Drill execution
- Record timeline at each step: record_event "description"
- Drill quarterly; rotate on-call engineers as participants
- Scenarios: DB failover, region failover, data restore, dependency outage
# Post-drill retrospective
- Compare actual RTO/RPO vs targets
- 5 Whys: trace gaps to root cause
- Action items: runbook updates, automation, monitoring, training
- Schedule next drill before closing the retro