Case topic · not in the main 20

Triage & incidents

Single-page topic Maps to Observability & incidents

Overview: under alert and customer pressure, drive agents through gather evidence → hypothesize → verify → communicate, with drafts for status pages and stakeholders. Separate auto-executable steps from escalation-only actions; rollbacks, deletes, and scale changes require explicit confirmation. Timelines, impact, and root-cause placeholders should export into postmortem templates. Fix internal and external comms cadence; avoid over-promising root cause before it is known; preserve log snippets, config changes, and deploy evidence for audit.

Relationship to the case hub: aligns with category Observability & incidents—especially Incident RCA and impact triage, Rollback, scale-out, and change drafts, and On-call handoff and summary. This is a single-topic overview; open those case pages for full playbooks.

Implementation notes (single page)

Severity and SLA

Encode P1/P2 and response SLAs in the skill to avoid on-call debate; recovery prioritizes stop-the-bleeding before root cause, with ordered dependencies.

Evidence and postmortems

Outputs should paste into incident docs; label unconfirmed items as hypotheses in external-facing text.

Trigger and exit (example)

Trigger: P1 alert or error rate breach. Output: timeline draft and suggested next steps. Exit: service restored or escalated.

Case hub · Observability & incidents Open devops category index Cases overview