Incident postmortem

Guide agents to fill postmortems with timeline, impact metrics, root-cause analysis, and trackable actions—blameless culture, five whys for systemic factors, and acceptance criteria per action item.

The summary should include start/recovery time, affected users or transaction share, and whether data was corrupted; redact sensitive details while keeping enough for learning.

Separate direct triggers from systemic factors (monitoring gaps, release process, capacity planning); embed five whys as a verifiable chain, not hollow questions.

Every action needs tracking: owner, due date, verification method; link tickets or epics so follow-ups do not vanish.

Blameless principles

Systems & process focus

Describe which mechanisms failed: late alerts, missing runbooks, approval bottlenecks. Do not reduce root cause to someone being “careless”; for human actions, cite training, tooling, or checklist gaps.

Psychological safety & ownership

Encourage full timelines and honest notes; owners drive action items, not blame lists. External vs internal narratives can be tiered with appropriate redaction.

Detection: why didn’t we alert or auto-mitigate earlier—signals, thresholds, on-call routing.
Response: was escalation smooth; were delays from missing info or permissions.
Comms: internal/external timing and tone; avoid accusatory language.

Complete incident postmortem document template (Markdown, compatible with Confluence / Notion):

# Incident Postmortem: [Service] [Short Description]
**Incident ID**: INC-2024-0315
**Severity**: P1
**Time window**: 2024-03-15 14:02 UTC — 2024-03-15 16:48 UTC (2h 46min)
**Author**: @on-call-engineer
**Review meeting**: 2024-03-17 10:00 UTC

## Summary
Payment service error rate spiked to 12% at 14:02, lasting 2h 46min,
affecting approximately 8% of checkout requests.
No data corruption; all failed requests returned clear error messages to users.

## Severity Definition
| Level | Criteria |
|-------|----------|
| P0 | Core service fully down, >50% of users affected, no workaround |
| P1 | Core path severely degraded, >20% of users, SLO violated |
| P2 | Partial feature impaired, workaround exists, SLO not violated |
| P3 | Experience degraded, no user-visible errors |

## Timeline (UTC)
| Time  | Event |
|-------|-------|
| 13:45 | Deployed payment-svc v2.4.1 to production (rolling release) |
| 14:02 | Error rate alert triggered (threshold >1% for 5 min) |
| 14:08 | On-call acknowledged, began investigation |
| 14:25 | Confirmed correlation with new version, decided to rollback |
| 14:35 | Rollback complete, error rate began declining |
| 14:48 | Error rate returned to normal, alert resolved |

## Impact Assessment
- Affected users: ~8% of checkout requests failed (~2,400 transactions)
- SLI impact: Success rate dropped from 99.95% to 88%, violating 99.9% SLO
- Data integrity: No corruption; idempotency design prevented duplicate charges

## Root Cause Analysis (see Five Whys below)
**Direct cause**: DB connection pool misconfiguration in v2.4.1 caused pool exhaustion

## Action Items
See SMART format below

If compliance investigations are required, separate them from the blameless learning review: learning doc vs investigation channel—do not merge into one external narrative.

Five whys

Start from an observable phenomenon; each “why” should map to checkable evidence (logs, config, change records, monitoring captures). If multiple causes branch, record branches or converge the main line first and add a “contributing factors” section.

Real incident example: payment service connection pool exhausted, full five-step walkthrough:

## Five Whys Root Cause Analysis

**Phenomenon (starting point)**
Payment API error rate spiked to 12% at 14:02; logs show "connection pool exhausted"

**Why 1**: Why was the connection pool exhausted?
→ DB connection pool max_size was incorrectly changed from 50 to 5
  Evidence: `kubectl get configmap payment-db-config -o yaml | grep max_pool_size` → output: 5
  Change record: v2.4.1 PR #892 modified db_config.yaml

**Why 2**: Why was the misconfiguration not caught?
→ PR review passed, but CI had no range validation for connection pool parameters
  Evidence: CI pipeline logs show the config validation step skipped pool parameter validation

**Why 3**: Why did CI not validate the connection pool parameter?
→ The config validation script only checked required fields, not valid numeric ranges
  Evidence: scripts/validate-config.sh checks max_pool_size is_set only, not >= 10

**Why 4**: Why did the validation script not check numeric ranges?
→ The script was originally written only for required fields and was never updated
  Evidence: git log scripts/validate-config.sh shows last modification was 18 months ago

**Why 5**: Why was there no staging load-test gate for config changes?
→ Connection pool config was classified as "low-risk change", exempted from staging validation
  Evidence: Release policy document marks db_config.yaml changes as "minor"

**Systemic contributing factors**:
1. No automated range validation for configuration parameters
2. Change risk classification rules too coarse-grained
3. Connection pool exhaustion alert threshold (>1% for 5 min) too conservative—should detect earlier

Avoid circular reasoning: each level should move toward design or process, not tautology.
By the fifth why, aim for actionable levers: monitoring, SLOs, release gates, capacity models, docs, training.
Distinguish from the direct trigger: the trigger is often the event; the chain exposes contributing factors and failed defenses.

Postmortem flow

  [ Detect: alert / customer report / internal find ]
        │
        ▼
  ┌─────────────┐     Stabilize: throttle, rollback, flags, isolate bad instances
  │ Stabilize    │──── Record: timestamps, operator, commands or ticket links
  └─────────────┘
        │
        ▼
  ┌─────────────┐     UTC + local TZ; key decisions, escalations, external comms
  │ Timeline     │──── Align changes, deploys, config, dependency events
  └─────────────┘
        │
        ▼
  ┌─────────────┐     Impact: user/txn %, SLIs, data integrity
  │ Impact       │──── Blameless phrasing: failure modes and detection gaps
  └─────────────┘
        │
        ▼
  ┌─────────────┐     Direct cause + five whys → systemic contributors
  │ Root cause   │──── Separate “trigger” vs “why we didn’t catch/recover sooner”
  └─────────────┘
        │
        ▼
  ┌─────────────┐     owner / due / acceptance / ticket links
  │ Actions      │──── Prevent recurrence + shorten detect/recover (optional priority)
  └─────────────┘
        │
        ▼
  ┌─────────────┐     Internal share; redact sensitive bits; optional public summary
  │ Share learnings │
  └─────────────┘

Action items & tracking

Tie each action to “how we know it’s done”: merged PR, dashboard update, runbook revision ID, drill sign-off. When drafting, write “TBD + suggested role” instead of dangling sentences without an owner.

SMART format action items example (for the incident above):

## Action Items (SMART Format)

### Short-term (within 1 week)
| # | Action | Owner | Due | Acceptance Criteria | Ticket |
|---|--------|-------|-----|---------------------|--------|
| 1 | Add range validation (>=10) for max_pool_size to CI | @backend-dev | 2024-03-22 | CI returns non-zero exit code for max_pool_size < 10 | #1456 |
| 2 | Update release policy: db_config changes upgraded to "medium-risk", require staging validation | @release-eng | 2024-03-22 | Policy doc PR merged, release captain confirms awareness | #1457 |

### Medium-term (within 1 month)
| # | Action | Owner | Due | Acceptance Criteria | Ticket |
|---|--------|-------|-----|---------------------|--------|
| 3 | Add connection pool utilization alert (>80% for 2 min -> warning) | @sre-team | 2024-04-15 | Grafana dashboard visible, test alert fires and notifies on-call | #1460 |
| 4 | Add JSON Schema validation rules for all DB config parameters | @platform | 2024-04-15 | Schema file PR merged, CI integration passing | #1461 |

### Open Questions (require further discussion)
- Should we introduce canary release for fast detection of connection pool exhaustion? (discuss at next arch review)
- Is staging environment configured with sufficient connection load testing? (@infra to evaluate)

Postmortem meeting facilitation guide:

# Postmortem Meeting Facilitation Guide

## Roles
- Facilitator: controls pace, keeps discussion focused on systems not individuals
- Scribe: records timeline, root causes, action items in real time
- Technical Lead: provides technical detail, does not drive conclusions
- All participants: focus on the issue, not the person; encourage honest sharing

## Meeting Structure (60 min example)
- 0-5 min:  Facilitator sets "blameless" tone and meeting objectives
- 5-15 min: Collaboratively confirm the timeline (supplement only, no debate)
- 15-30 min: Impact assessment and data confirmation
- 30-45 min: Walk through Five Whys root cause chain (facilitator controls depth)
- 45-58 min: Draft action items; confirm owner and due date for each
- 58-60 min: Wrap up; agree on postmortem document publish time

## Phrases to Avoid (facilitator intervenes)
- "It's your fault" / "You shouldn't have" -> reframe as "How can this process/tool improve?"
- "Always like this" / "Never" -> stick to facts of this incident, don't generalize
- "It's simple, just..." -> avoid hindsight bias

Cross-link epics/tickets; meeting notes reference the same IDs.
Set review dates: short-term fixes vs structural improvements.
Keep an “open questions” section so verbal agreements are not lost.

Five-whys draft builder

Enter the phenomenon and up to five whys to produce a Markdown snippet for the postmortem—handy right after a war room, then validate each layer against evidence.

Phenomenon or direct trigger (starting point)

Why 1 Why 2 Why 3 Why 4 Why 5

Output

Empty lines are skipped; validate each layer with logs and change records—no invented chains. Keep wording blameless: describe mechanism failure, not moral judgment of individuals.

---
name: incident-postmortem
description: Draft blameless postmortems from timelines and logs
model: claude-sonnet-4-5
---

# Required postmortem sections
sections:
  - Summary (time window, % users affected, data corruption status)
  - Severity level (P0/P1/P2/P3 definition and current determination)
  - Timeline (UTC, include deploy/config change events)
  - Impact assessment (SLI data, user/transaction percentages)
  - Root cause analysis (direct cause + five whys + systemic factors)
  - Action items (SMART: owner / due / acceptance criteria / ticket)
  - Open questions

# Blameless writing rules
blameless_rules:
  - Describe mechanism failures, not individual fault
  - Human actions: analyze training/tooling/checklist gaps instead
  - Action item owners drive improvements, not blame

# Five Whys requirements
five_whys:
  - Each why includes verifiable evidence (logs/config/change record)
  - Fifth why leads to actionable levers (monitoring/process/docs)
  - Avoid circular reasoning and tautologies

All skills More skills