Incident postmortem
Guide agents to fill postmortems with timeline, impact metrics, root-cause analysis, and trackable actions—blameless culture, five whys for systemic factors, and acceptance criteria per action item.
The summary should include start/recovery time, affected users or transaction share, and whether data was corrupted; redact sensitive details while keeping enough for learning.
Separate direct triggers from systemic factors (monitoring gaps, release process, capacity planning); embed five whys as a verifiable chain, not hollow questions.
Every action needs tracking: owner, due date, verification method; link tickets or epics so follow-ups do not vanish.
Blameless principles
Systems & process focus
Describe which mechanisms failed: late alerts, missing runbooks, approval bottlenecks. Do not reduce root cause to someone being “careless”; for human actions, cite training, tooling, or checklist gaps.
Psychological safety & ownership
Encourage full timelines and honest notes; owners drive action items, not blame lists. External vs internal narratives can be tiered with appropriate redaction.
- Detection: why didn’t we alert or auto-mitigate earlier—signals, thresholds, on-call routing.
- Response: was escalation smooth; were delays from missing info or permissions.
- Comms: internal/external timing and tone; avoid accusatory language.
Complete incident postmortem document template (Markdown, compatible with Confluence / Notion):
# Incident Postmortem: [Service] [Short Description]
**Incident ID**: INC-2024-0315
**Severity**: P1
**Time window**: 2024-03-15 14:02 UTC — 2024-03-15 16:48 UTC (2h 46min)
**Author**: @on-call-engineer
**Review meeting**: 2024-03-17 10:00 UTC
## Summary
Payment service error rate spiked to 12% at 14:02, lasting 2h 46min,
affecting approximately 8% of checkout requests.
No data corruption; all failed requests returned clear error messages to users.
## Severity Definition
| Level | Criteria |
|-------|----------|
| P0 | Core service fully down, >50% of users affected, no workaround |
| P1 | Core path severely degraded, >20% of users, SLO violated |
| P2 | Partial feature impaired, workaround exists, SLO not violated |
| P3 | Experience degraded, no user-visible errors |
## Timeline (UTC)
| Time | Event |
|-------|-------|
| 13:45 | Deployed payment-svc v2.4.1 to production (rolling release) |
| 14:02 | Error rate alert triggered (threshold >1% for 5 min) |
| 14:08 | On-call acknowledged, began investigation |
| 14:25 | Confirmed correlation with new version, decided to rollback |
| 14:35 | Rollback complete, error rate began declining |
| 14:48 | Error rate returned to normal, alert resolved |
## Impact Assessment
- Affected users: ~8% of checkout requests failed (~2,400 transactions)
- SLI impact: Success rate dropped from 99.95% to 88%, violating 99.9% SLO
- Data integrity: No corruption; idempotency design prevented duplicate charges
## Root Cause Analysis (see Five Whys below)
**Direct cause**: DB connection pool misconfiguration in v2.4.1 caused pool exhaustion
## Action Items
See SMART format below
If compliance investigations are required, separate them from the blameless learning review: learning doc vs investigation channel—do not merge into one external narrative.
Five whys
Start from an observable phenomenon; each “why” should map to checkable evidence (logs, config, change records, monitoring captures). If multiple causes branch, record branches or converge the main line first and add a “contributing factors” section.
Real incident example: payment service connection pool exhausted, full five-step walkthrough:
## Five Whys Root Cause Analysis
**Phenomenon (starting point)**
Payment API error rate spiked to 12% at 14:02; logs show "connection pool exhausted"
**Why 1**: Why was the connection pool exhausted?
→ DB connection pool max_size was incorrectly changed from 50 to 5
Evidence: `kubectl get configmap payment-db-config -o yaml | grep max_pool_size` → output: 5
Change record: v2.4.1 PR #892 modified db_config.yaml
**Why 2**: Why was the misconfiguration not caught?
→ PR review passed, but CI had no range validation for connection pool parameters
Evidence: CI pipeline logs show the config validation step skipped pool parameter validation
**Why 3**: Why did CI not validate the connection pool parameter?
→ The config validation script only checked required fields, not valid numeric ranges
Evidence: scripts/validate-config.sh checks max_pool_size is_set only, not >= 10
**Why 4**: Why did the validation script not check numeric ranges?
→ The script was originally written only for required fields and was never updated
Evidence: git log scripts/validate-config.sh shows last modification was 18 months ago
**Why 5**: Why was there no staging load-test gate for config changes?
→ Connection pool config was classified as "low-risk change", exempted from staging validation
Evidence: Release policy document marks db_config.yaml changes as "minor"
**Systemic contributing factors**:
1. No automated range validation for configuration parameters
2. Change risk classification rules too coarse-grained
3. Connection pool exhaustion alert threshold (>1% for 5 min) too conservative—should detect earlier
- Avoid circular reasoning: each level should move toward design or process, not tautology.
- By the fifth why, aim for actionable levers: monitoring, SLOs, release gates, capacity models, docs, training.
- Distinguish from the direct trigger: the trigger is often the event; the chain exposes contributing factors and failed defenses.
Postmortem flow
[ Detect: alert / customer report / internal find ]
│
▼
┌─────────────┐ Stabilize: throttle, rollback, flags, isolate bad instances
│ Stabilize │──── Record: timestamps, operator, commands or ticket links
└─────────────┘
│
▼
┌─────────────┐ UTC + local TZ; key decisions, escalations, external comms
│ Timeline │──── Align changes, deploys, config, dependency events
└─────────────┘
│
▼
┌─────────────┐ Impact: user/txn %, SLIs, data integrity
│ Impact │──── Blameless phrasing: failure modes and detection gaps
└─────────────┘
│
▼
┌─────────────┐ Direct cause + five whys → systemic contributors
│ Root cause │──── Separate “trigger” vs “why we didn’t catch/recover sooner”
└─────────────┘
│
▼
┌─────────────┐ owner / due / acceptance / ticket links
│ Actions │──── Prevent recurrence + shorten detect/recover (optional priority)
└─────────────┘
│
▼
┌─────────────┐ Internal share; redact sensitive bits; optional public summary
│ Share learnings │
└─────────────┘
Action items & tracking
Tie each action to “how we know it’s done”: merged PR, dashboard update, runbook revision ID, drill sign-off. When drafting, write “TBD + suggested role” instead of dangling sentences without an owner.
SMART format action items example (for the incident above):
## Action Items (SMART Format)
### Short-term (within 1 week)
| # | Action | Owner | Due | Acceptance Criteria | Ticket |
|---|--------|-------|-----|---------------------|--------|
| 1 | Add range validation (>=10) for max_pool_size to CI | @backend-dev | 2024-03-22 | CI returns non-zero exit code for max_pool_size < 10 | #1456 |
| 2 | Update release policy: db_config changes upgraded to "medium-risk", require staging validation | @release-eng | 2024-03-22 | Policy doc PR merged, release captain confirms awareness | #1457 |
### Medium-term (within 1 month)
| # | Action | Owner | Due | Acceptance Criteria | Ticket |
|---|--------|-------|-----|---------------------|--------|
| 3 | Add connection pool utilization alert (>80% for 2 min -> warning) | @sre-team | 2024-04-15 | Grafana dashboard visible, test alert fires and notifies on-call | #1460 |
| 4 | Add JSON Schema validation rules for all DB config parameters | @platform | 2024-04-15 | Schema file PR merged, CI integration passing | #1461 |
### Open Questions (require further discussion)
- Should we introduce canary release for fast detection of connection pool exhaustion? (discuss at next arch review)
- Is staging environment configured with sufficient connection load testing? (@infra to evaluate)
Postmortem meeting facilitation guide:
# Postmortem Meeting Facilitation Guide
## Roles
- Facilitator: controls pace, keeps discussion focused on systems not individuals
- Scribe: records timeline, root causes, action items in real time
- Technical Lead: provides technical detail, does not drive conclusions
- All participants: focus on the issue, not the person; encourage honest sharing
## Meeting Structure (60 min example)
- 0-5 min: Facilitator sets "blameless" tone and meeting objectives
- 5-15 min: Collaboratively confirm the timeline (supplement only, no debate)
- 15-30 min: Impact assessment and data confirmation
- 30-45 min: Walk through Five Whys root cause chain (facilitator controls depth)
- 45-58 min: Draft action items; confirm owner and due date for each
- 58-60 min: Wrap up; agree on postmortem document publish time
## Phrases to Avoid (facilitator intervenes)
- "It's your fault" / "You shouldn't have" -> reframe as "How can this process/tool improve?"
- "Always like this" / "Never" -> stick to facts of this incident, don't generalize
- "It's simple, just..." -> avoid hindsight bias
- Cross-link epics/tickets; meeting notes reference the same IDs.
- Set review dates: short-term fixes vs structural improvements.
- Keep an “open questions” section so verbal agreements are not lost.
Five-whys draft builder
Enter the phenomenon and up to five whys to produce a Markdown snippet for the postmortem—handy right after a war room, then validate each layer against evidence.
Empty lines are skipped; validate each layer with logs and change records—no invented chains. Keep wording blameless: describe mechanism failure, not moral judgment of individuals.
---
name: incident-postmortem
description: Draft blameless postmortems from timelines and logs
model: claude-sonnet-4-5
---
# Required postmortem sections
sections:
- Summary (time window, % users affected, data corruption status)
- Severity level (P0/P1/P2/P3 definition and current determination)
- Timeline (UTC, include deploy/config change events)
- Impact assessment (SLI data, user/transaction percentages)
- Root cause analysis (direct cause + five whys + systemic factors)
- Action items (SMART: owner / due / acceptance criteria / ticket)
- Open questions
# Blameless writing rules
blameless_rules:
- Describe mechanism failures, not individual fault
- Human actions: analyze training/tooling/checklist gaps instead
- Action item owners drive improvements, not blame
# Five Whys requirements
five_whys:
- Each why includes verifiable evidence (logs/config/change record)
- Fifth why leads to actionable levers (monitoring/process/docs)
- Avoid circular reasoning and tautologies