Runbook authoring
Have agents chain alerts, dashboards, and CLI checks into step-by-step instructions with prerequisites, expected output, and explicit “stop and escalate” thresholds—no invented metric names or cluster context.
Open with service scope, dependencies, and maintenance windows; every command should be copy-pasteable with read-only vs mutating operations called out. Deep-link related playbooks and dashboards to reduce context switching.
When the SKILL lacks facts, emit explicit <TBD> markers instead of fabricating; keep known false positives and fast rules-out in their own section so on-call can skip irrelevant branches under pressure.
Scope, dependencies, audience
State which components and environments (prod/stage) the runbook covers and the boundary between data and control planes; list required permissions (read-only roles, whether break-glass accounts are allowed) and maintenance window notes.
- Audience: tier-1 on-call, domain experts, whether management must be notified.
- Links: architecture diagrams, SLO pages, related runbook index.
- Out of scope: “use runbook X instead” to prevent misuse.
Complete Runbook header template (required in every Runbook):
# Runbook: Database Connection Pool Exhausted
**Service**: payment-svc (prod, k8s/ns: payments)
**Version**: v3 — Last updated: 2024-03-20 by @sre-alice
**Owner**: SRE Team (PagerDuty: pd.link/payment-sre)
**Trigger alerts**: PaymentDB_PoolExhausted / PaymentAPI_HighErrorRate
## Audience & Permissions
- **L1 on-call**: Read-only permissions sufficient for diagnostic steps
- **L2 expert**: Requires payment-admin role for mitigation steps
- **Management notification**: Notify @eng-manager when impact >5% of users or lasts >30 min
## Dependent Services
- PostgreSQL 15 (RDS: payment-db-prod)
- Redis 7 (ElastiCache: payment-cache-prod)
- Downstream: order-svc, notification-svc
## Out of Scope
- Database disk full: use [db-disk-full runbook](./db-disk-full.md)
- Redis connection issues: use [redis-slow runbook](./redis-slow.md)
- During maintenance windows: follow [planned-maintenance](./maintenance.md) process
Triggers & pre-checks
Triggers
- Specific alert names, queries, or dashboard panels (pasteable PromQL / LogQL snippets).
- Thresholds: e.g. error rate > 1% for 5m—use numbers, not “elevated.”
- Correlation with business events (deploys, traffic shifts, campaigns) when applicable.
Pre-checks
- Confirm no planned change or known incident in flight.
- Confirm kubectl / cloud CLI context and project IDs (command examples).
- Pass read-only checks before any mutating steps.
Trigger conditions and pre-check code example:
## Trigger Conditions
# Alert: PaymentDB_PoolExhausted
# PromQL:
hikaricp_connections_active{pool="payment-db"} /
hikaricp_connections_max{pool="payment-db"} > 0.9
# Threshold: connection pool usage > 90% for 2 min → warning
# connection pool usage > 98% for 1 min → critical (triggers this runbook)
## Pre-checks (read-only, ~2 minutes)
# 1. Confirm kubectl context
kubectl config current-context
# Expected output: prod-cluster
# 2. Confirm no planned changes
kubectl get events -n payments --field-selector reason=Killing --sort-by='.lastTimestamp' | tail -5
# Expected output: no recent pod restart events
# 3. Confirm alert is real (not monitoring jitter)
kubectl exec -n payments deployment/payment-svc -- curl -s http://localhost:8080/actuator/health/db
# Expected output: {"status":"UP"} or "DOWN" (if DOWN, proceed to next step)
Diagnostic pipeline (read-only first)
[ Alert / ticket fires ]
│
▼
┌─────────────┐ Read-only: metrics / logs / traces / health endpoints
│ Narrow scope │──── Output: affected replicas, region, dependency health
└─────────────┘
│
▼
┌─────────────┐ Compare to “expected output” table; capture actuals + time
│ Step commands │──── Each step: permissions, parallel OK?, expected duration
└─────────────┘
│
▼
┌─────────────┐ Hit escalation threshold → stop self-heal, escalate with evidence
│ Decide: mitigate │ Else → false-positive path or document new root cause
│ or escalate │
└─────────────┘
Every step needs “what you should see” and “if not, do next”; avoid open-ended “check the database” without concrete commands and criteria.
Specific diagnostic steps for DB connection pool exhaustion (each step: command + expected output + on failure):
## Diagnostic Steps (read-only, ~5 minutes)
### Step 1: Check current connection pool status
kubectl exec -n payments deployment/payment-svc -- \
curl -s http://localhost:8080/actuator/metrics/hikaricp.connections.active
# Expected output (normal): {"name":"hikaricp.connections.active","measurements":[{"statistic":"VALUE","value":12.0}]}
# On failure: value >= max_pool_size (default 50) -> confirm pool exhaustion, proceed to step 2
# Note: read-only operation, no side effects
### Step 2: Check slow query logs (read-only)
kubectl exec -n payments \
$(kubectl get pod -n payments -l app=payment-svc -o jsonpath='{.items[0].metadata.name}') \
-- cat /var/log/app/slow-queries.log | tail -50
# Expected output (normal): no records with duration>1000ms
# Warning signs: many duration>5000ms or same SQL -> record the query, proceed to step 3
### Step 3: Check database connection count
kubectl exec -n payments deployment/payment-svc -- \
psql "$DB_URL" -c "SELECT count(*), state FROM pg_stat_activity GROUP BY state;"
# Expected output:
# count | state
# -------+--------
# 45 | active
# 5 | idle
# Abnormal: active connections >= max_connections(100) -> enter mitigation flow
# On failure (no permission): contact DBA or escalate to L2
Mitigate, rollback, feature flags
Order mitigations by risk: scale out, throttle, degrade reads before restart or traffic shift. Rollback order: feature flags → app version → whether DB migrations are reversible.
## Mitigation Steps (ordered by risk, low to high)
### Option A: Temporarily increase connection pool (low risk, requires payment-admin)
# Prerequisite: confirm DB max_connections allows more connections
kubectl set env -n payments deployment/payment-svc HIKARI_MAX_POOL_SIZE=80
# Verification: wait 30s, re-run step 1, expect active < 70
### Option B: Kill idle timeout connections (medium risk)
kubectl exec -n payments deployment/payment-svc -- psql "$DB_URL" -c "
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle' AND state_change < NOW() - INTERVAL '10 minutes';"
# Expected: pg_terminate_backend column has true values
# Warning: this interrupts existing connections; verify subsequent requests are normal
### Option C: Rollback application version (high risk, affects all in-flight requests)
# 1. Disable feature flag first (if applicable)
kubectl set env -n payments deployment/payment-svc FEATURE_NEW_QUERY_OPTIMIZER=false
# Wait 60s, observe error rate
# 2. If no improvement, rollback version
kubectl rollout undo deployment/payment-svc -n payments
kubectl rollout status deployment/payment-svc -n payments
# Expected: deployment "payment-svc" successfully rolled out
# Rollback verification (same commands as trigger section)
kubectl exec -n payments deployment/payment-svc -- curl -s http://localhost:8080/actuator/health/db
# Expected: {"status":"UP"}
- Before each mutating command, re-confirm context (cluster, namespace, release name).
- After rollback, re-run the same checks as the trigger section with expected healthy ranges.
- If rollback is forbidden (compliance, data migration), document alternate mitigation and escalation criteria.
Escalation (L1 → L2 → management)
List hard criteria for L2: e.g. user impact > N, data corruption risk, cannot mitigate within 30 minutes, needs cross-team permissions—plus on-call directory or PagerDuty policy links.
- L1: execute this runbook to decision points; collect timeline and command excerpts.
- L2: domain experts; may involve hotfix or infra change.
- Management: external comms, customer commitments, emergency resources; who can approve downtime.
False positives & fast rules-out
Dedicated section for known false positives (noisy metrics, dependency maintenance windows, sampling issues) with 2–3 fastest exclusion commands; branch off the main diagnostic tree to reduce noise.
- If confirmed false positive: suggested silence duration, labels, follow-up ticket (tune threshold or alert rule).
- Cross-reference the trigger section so thresholds do not contradict.
Aftermath: monitoring & runbook updates
After resolution, check whether alert thresholds, dashboards, or unclear runbook steps need updates; link to a postmortem template when applicable.
- Escalation path: L1 → L2 → management criteria (shorten to “same as § Escalation” if identical).
- Rollback: flags, deploy version, DB patch order (echo mitigation section).
- Documentation debt: owner, due date, verification (cross-link tickets).
Runbook validity test strategy (periodic drills):
# Runbook Validity Checklist (run quarterly)
## Drill Methods
1. Inject fault in staging environment (chaos engineering)
kubectl exec -n payments-staging deployment/payment-svc -- \
curl -X POST http://localhost:8080/actuator/chaos/pool-exhaust
# Then follow runbook steps, record actual time taken
2. Dry-run review
- New team member reads runbook independently and attempts to narrate steps
- Record sticking points -> update documentation
## Validity Criteria
- [ ] All commands executable on current version (no deprecated APIs)
- [ ] Each step "expected output" matches actual environment
- [ ] New on-call member (no background) can complete diagnosis in 10 min
- [ ] Escalation conditions match current SLA definitions
- [ ] External links (dashboard, PD policy, related runbooks) all accessible
## Revision Log
| Date | Author | Change | Version |
|------|--------|--------|---------|
| 2024-03-20 | @sre-alice | Added connection pool diagnostic step 3 | v3 |
| 2024-01-15 | @sre-bob | Updated escalation threshold | v2 |
Runbook outline lab
Set a title and optional service name, select sections, and generate Markdown you can paste into a wiki or repo—then replace placeholders with real alerts and commands.
Mark unknowns with <TBD>; agents must replace them with real links, queries, and thresholds from the repo—never leave fabricated values.
---
name: runbook-authoring
description: Generate ops runbook skeletons from alerts and architecture context
model: claude-sonnet-4-5
---
# Required Runbook sections
sections:
- Header: service/version/owner/trigger alerts
- Scope: environment/audience permissions/dependencies/out of scope
- Trigger conditions: PromQL/thresholds/business correlation
- Pre-checks: read-only commands to confirm context
- Diagnostic steps: command + expected output + on failure
- Mitigate & rollback: ordered by risk (low to high)
- Escalation path: L1→L2→management hard criteria
- False positives & fast rules-out
- Aftermath: test records and documentation debt
# Step format convention
step_format: |
### Step N: [description] ([read-only | mutating])
Command: `kubectl ...`
Expected output: [specific content]
On failure: [next step or escalation criteria]
# Forbidden
forbidden:
- Do not fabricate metric names, cluster names, or config values
- Output <TBD> placeholder when information is insufficient
- Do not omit permission notes or expected output