Runbook authoring

Have agents chain alerts, dashboards, and CLI checks into step-by-step instructions with prerequisites, expected output, and explicit “stop and escalate” thresholds—no invented metric names or cluster context.

Open with service scope, dependencies, and maintenance windows; every command should be copy-pasteable with read-only vs mutating operations called out. Deep-link related playbooks and dashboards to reduce context switching.

When the SKILL lacks facts, emit explicit <TBD> markers instead of fabricating; keep known false positives and fast rules-out in their own section so on-call can skip irrelevant branches under pressure.

Scope, dependencies, audience

State which components and environments (prod/stage) the runbook covers and the boundary between data and control planes; list required permissions (read-only roles, whether break-glass accounts are allowed) and maintenance window notes.

Audience: tier-1 on-call, domain experts, whether management must be notified.
Links: architecture diagrams, SLO pages, related runbook index.
Out of scope: “use runbook X instead” to prevent misuse.

Complete Runbook header template (required in every Runbook):

# Runbook: Database Connection Pool Exhausted
**Service**: payment-svc (prod, k8s/ns: payments)
**Version**: v3 — Last updated: 2024-03-20 by @sre-alice
**Owner**: SRE Team (PagerDuty: pd.link/payment-sre)
**Trigger alerts**: PaymentDB_PoolExhausted / PaymentAPI_HighErrorRate

## Audience & Permissions
- **L1 on-call**: Read-only permissions sufficient for diagnostic steps
- **L2 expert**: Requires payment-admin role for mitigation steps
- **Management notification**: Notify @eng-manager when impact >5% of users or lasts >30 min

## Dependent Services
- PostgreSQL 15 (RDS: payment-db-prod)
- Redis 7 (ElastiCache: payment-cache-prod)
- Downstream: order-svc, notification-svc

## Out of Scope
- Database disk full: use [db-disk-full runbook](./db-disk-full.md)
- Redis connection issues: use [redis-slow runbook](./redis-slow.md)
- During maintenance windows: follow [planned-maintenance](./maintenance.md) process

Triggers & pre-checks

Triggers

Specific alert names, queries, or dashboard panels (pasteable PromQL / LogQL snippets).
Thresholds: e.g. error rate > 1% for 5m—use numbers, not “elevated.”
Correlation with business events (deploys, traffic shifts, campaigns) when applicable.

Pre-checks

Confirm no planned change or known incident in flight.
Confirm kubectl / cloud CLI context and project IDs (command examples).
Pass read-only checks before any mutating steps.

Trigger conditions and pre-check code example:

## Trigger Conditions
# Alert: PaymentDB_PoolExhausted
# PromQL:
hikaricp_connections_active{pool="payment-db"} /
hikaricp_connections_max{pool="payment-db"} > 0.9

# Threshold: connection pool usage > 90% for 2 min → warning
#            connection pool usage > 98% for 1 min → critical (triggers this runbook)

## Pre-checks (read-only, ~2 minutes)
# 1. Confirm kubectl context
kubectl config current-context
# Expected output: prod-cluster

# 2. Confirm no planned changes
kubectl get events -n payments --field-selector reason=Killing --sort-by='.lastTimestamp' | tail -5
# Expected output: no recent pod restart events

# 3. Confirm alert is real (not monitoring jitter)
kubectl exec -n payments deployment/payment-svc --   curl -s http://localhost:8080/actuator/health/db
# Expected output: {"status":"UP"} or "DOWN" (if DOWN, proceed to next step)

Diagnostic pipeline (read-only first)

  [ Alert / ticket fires ]
        │
        ▼
  ┌─────────────┐     Read-only: metrics / logs / traces / health endpoints
  │ Narrow scope │──── Output: affected replicas, region, dependency health
  └─────────────┘
        │
        ▼
  ┌─────────────┐     Compare to “expected output” table; capture actuals + time
  │ Step commands │──── Each step: permissions, parallel OK?, expected duration
  └─────────────┘
        │
        ▼
  ┌─────────────┐     Hit escalation threshold → stop self-heal, escalate with evidence
  │ Decide: mitigate │     Else → false-positive path or document new root cause
  │ or escalate      │
  └─────────────┘

Every step needs “what you should see” and “if not, do next”; avoid open-ended “check the database” without concrete commands and criteria.

Specific diagnostic steps for DB connection pool exhaustion (each step: command + expected output + on failure):

## Diagnostic Steps (read-only, ~5 minutes)

### Step 1: Check current connection pool status
kubectl exec -n payments deployment/payment-svc -- \
  curl -s http://localhost:8080/actuator/metrics/hikaricp.connections.active

# Expected output (normal): {"name":"hikaricp.connections.active","measurements":[{"statistic":"VALUE","value":12.0}]}
# On failure: value >= max_pool_size (default 50) -> confirm pool exhaustion, proceed to step 2
# Note: read-only operation, no side effects

### Step 2: Check slow query logs (read-only)
kubectl exec -n payments \
  $(kubectl get pod -n payments -l app=payment-svc -o jsonpath='{.items[0].metadata.name}') \
  -- cat /var/log/app/slow-queries.log | tail -50

# Expected output (normal): no records with duration>1000ms
# Warning signs: many duration>5000ms or same SQL -> record the query, proceed to step 3

### Step 3: Check database connection count
kubectl exec -n payments deployment/payment-svc -- \
  psql "$DB_URL" -c "SELECT count(*), state FROM pg_stat_activity GROUP BY state;"

# Expected output:
#  count | state
# -------+--------
#     45 | active
#      5 | idle
# Abnormal: active connections >= max_connections(100) -> enter mitigation flow
# On failure (no permission): contact DBA or escalate to L2

Mitigate, rollback, feature flags

Order mitigations by risk: scale out, throttle, degrade reads before restart or traffic shift. Rollback order: feature flags → app version → whether DB migrations are reversible.

## Mitigation Steps (ordered by risk, low to high)

### Option A: Temporarily increase connection pool (low risk, requires payment-admin)
# Prerequisite: confirm DB max_connections allows more connections
kubectl set env -n payments deployment/payment-svc   HIKARI_MAX_POOL_SIZE=80
# Verification: wait 30s, re-run step 1, expect active < 70

### Option B: Kill idle timeout connections (medium risk)
kubectl exec -n payments deployment/payment-svc --   psql "$DB_URL" -c "
  SELECT pg_terminate_backend(pid)
  FROM pg_stat_activity
  WHERE state = 'idle' AND state_change < NOW() - INTERVAL '10 minutes';"
# Expected: pg_terminate_backend column has true values
# Warning: this interrupts existing connections; verify subsequent requests are normal

### Option C: Rollback application version (high risk, affects all in-flight requests)
# 1. Disable feature flag first (if applicable)
kubectl set env -n payments deployment/payment-svc   FEATURE_NEW_QUERY_OPTIMIZER=false
# Wait 60s, observe error rate

# 2. If no improvement, rollback version
kubectl rollout undo deployment/payment-svc -n payments
kubectl rollout status deployment/payment-svc -n payments
# Expected: deployment "payment-svc" successfully rolled out

# Rollback verification (same commands as trigger section)
kubectl exec -n payments deployment/payment-svc --   curl -s http://localhost:8080/actuator/health/db
# Expected: {"status":"UP"}

Before each mutating command, re-confirm context (cluster, namespace, release name).
After rollback, re-run the same checks as the trigger section with expected healthy ranges.
If rollback is forbidden (compliance, data migration), document alternate mitigation and escalation criteria.

Escalation (L1 → L2 → management)

List hard criteria for L2: e.g. user impact > N, data corruption risk, cannot mitigate within 30 minutes, needs cross-team permissions—plus on-call directory or PagerDuty policy links.

L1: execute this runbook to decision points; collect timeline and command excerpts.
L2: domain experts; may involve hotfix or infra change.
Management: external comms, customer commitments, emergency resources; who can approve downtime.

False positives & fast rules-out

Dedicated section for known false positives (noisy metrics, dependency maintenance windows, sampling issues) with 2–3 fastest exclusion commands; branch off the main diagnostic tree to reduce noise.

If confirmed false positive: suggested silence duration, labels, follow-up ticket (tune threshold or alert rule).
Cross-reference the trigger section so thresholds do not contradict.

Aftermath: monitoring & runbook updates

After resolution, check whether alert thresholds, dashboards, or unclear runbook steps need updates; link to a postmortem template when applicable.

Escalation path: L1 → L2 → management criteria (shorten to “same as § Escalation” if identical).
Rollback: flags, deploy version, DB patch order (echo mitigation section).
Documentation debt: owner, due date, verification (cross-link tickets).

Runbook validity test strategy (periodic drills):

# Runbook Validity Checklist (run quarterly)

## Drill Methods
1. Inject fault in staging environment (chaos engineering)
   kubectl exec -n payments-staging deployment/payment-svc -- \
     curl -X POST http://localhost:8080/actuator/chaos/pool-exhaust
   # Then follow runbook steps, record actual time taken

2. Dry-run review
   - New team member reads runbook independently and attempts to narrate steps
   - Record sticking points -> update documentation

## Validity Criteria
- [ ] All commands executable on current version (no deprecated APIs)
- [ ] Each step "expected output" matches actual environment
- [ ] New on-call member (no background) can complete diagnosis in 10 min
- [ ] Escalation conditions match current SLA definitions
- [ ] External links (dashboard, PD policy, related runbooks) all accessible

## Revision Log
| Date | Author | Change | Version |
|------|--------|--------|---------|
| 2024-03-20 | @sre-alice | Added connection pool diagnostic step 3 | v3 |
| 2024-01-15 | @sre-bob | Updated escalation threshold | v2 |

Runbook outline lab

Set a title and optional service name, select sections, and generate Markdown you can paste into a wiki or repo—then replace placeholders with real alerts and commands.

Runbook title

Service / scope (optional, one line)

Sections to include

Scope & dependencies Triggers & pre-checks Diagnostic steps Mitigate & rollback Escalation False-positive rules-out Post-incident updates

Mark unknowns with <TBD>; agents must replace them with real links, queries, and thresholds from the repo—never leave fabricated values.

---
name: runbook-authoring
description: Generate ops runbook skeletons from alerts and architecture context
model: claude-sonnet-4-5
---

# Required Runbook sections
sections:
  - Header: service/version/owner/trigger alerts
  - Scope: environment/audience permissions/dependencies/out of scope
  - Trigger conditions: PromQL/thresholds/business correlation
  - Pre-checks: read-only commands to confirm context
  - Diagnostic steps: command + expected output + on failure
  - Mitigate & rollback: ordered by risk (low to high)
  - Escalation path: L1→L2→management hard criteria
  - False positives & fast rules-out
  - Aftermath: test records and documentation debt

# Step format convention
step_format: |
  ### Step N: [description] ([read-only | mutating])
  Command: `kubectl ...`
  Expected output: [specific content]
  On failure: [next step or escalation criteria]

# Forbidden
forbidden:
  - Do not fabricate metric names, cluster names, or config values
  - Output <TBD> placeholder when information is insufficient
  - Do not omit permission notes or expected output

All skills More skills