Alerting & on-call
Help Agents align alert policy with severity, blast radius, and customer-facing SLOs, and produce actionable on-call steps: confirm, mitigate, communicate, postmortem.
A SKILL should separate “page immediately” from “ticket-level” events, document routing, silencing windows, deduplication, and dependency suppression to avoid alert storms burning out responders.
Bind a runbook to each critical alert: dependency checks, recent deploys, mitigations (scale, degrade, failover), rollback or feature-flag location, and when to escalate to tier-2 or leadership.
When an Agent assists, emphasize factual logging: timeline, blast radius, actions tried, and current state—for handoff and post-incident reports.
- Alerts must be explainable against user or business metrics; forbid “CPU high” without service and tenant context.
- Playbooks should define single-responder authority limits and when a war room is required.
- Pair with logging and tracing skills: embed typical queries and dashboard link placeholders in runbooks.
On-call response path
[ Alert fires / page escalates ]
│
▼
┌─────────────┐ Ack: who is on it, next update ETA
│ Triage │──── Compare to SLO / customer impact / change window
└─────────────┘
│
▼
┌─────────────┐ Runbook: deps, signature queries, recent deploys/config
│ Mitigate │──── Mitigate first: throttle, degrade, scale, rollback, flags
└─────────────┘
│
▼
┌─────────────┐ Status: internal channel + external status template if needed
│ Communicate │──── Escalate on threshold or missing permissions: tier-2 / mgmt / vendor
└─────────────┘
│
▼
┌─────────────┐ Stable: clear alert, log timeline, open follow-ups
│ Recover │──── Cross-check root-cause hypotheses with metrics, logs, traces
└─────────────┘
Keep one primary thread per incident: merge duplicate alerts into it; on handoff, pin current hypothesis, actions taken, and next steps in the ticket or channel topic.
Alert routing and noise reduction
Routing decides who gets notified, on which channel, at what priority: split by service ownership, tenant tier, environment (prod/stage), and customer contracts; pager and chat bots should target the current on-call rotation, not a static individual.
- Grouping and suppression: collapse repeated alerts from the same source into counts; suppress child services when upstream is failing.
- Silences and maintenance: planned changes use time-bounded silences with reason; no “silent forever” without a tracked ticket.
- Escalation ladder: unacknowledged after N minutes, severity bump, or customer-visible outage triggers tier-2 or phone.
Complete Prometheus alert rules YAML example:
# prometheus/alerts/payment-svc.yml
groups:
- name: payment-svc-slo
interval: 30s
rules:
# P1: SLO violation - high error rate
- alert: PaymentHighErrorRate
expr: |
sum(rate(http_requests_total{job="payment-svc",code=~"5.."}[5m]))
/ sum(rate(http_requests_total{job="payment-svc"}[5m]))
> 0.01
for: 5m
labels:
severity: "P1"
team: "payments"
slo: "availability"
annotations:
summary: "Payment API error rate {{ $value | humanizePercentage }} exceeds SLO"
description: "Error rate has been above 1% for 5 minutes. Current: {{ $value | humanizePercentage }}"
runbook_url: "https://wiki/runbooks/payment-svc/high-error-rate"
dashboard_url: "https://grafana/d/payment-golden-signals"
# P2: latency alert
- alert: PaymentHighLatency
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket{job="payment-svc"}[5m]))
by (le)
) > 0.5
for: 10m
labels:
severity: "P2"
team: "payments"
annotations:
summary: "Payment p99 latency {{ $value | humanizeDuration }} exceeds 500ms"
runbook_url: "https://wiki/runbooks/payment-svc/high-latency"
# Infrastructure alert (disk)
- alert: DiskSpaceLow
expr: |
(node_filesystem_free_bytes{mountpoint="/"}
/ node_filesystem_size_bytes{mountpoint="/"}) < 0.15
for: 5m
labels:
severity: "P2"
team: "infra"
annotations:
summary: "Disk space below 15% on {{ $labels.instance }}"
Alertmanager tiered routing and noise-reduction configuration:
# alertmanager.yml
global:
resolve_timeout: 5m
route:
group_by: ['alertname', 'team'] # group by team; merge same alert combinations
group_wait: 30s # wait 30s to aggregate alerts in the same group
group_interval: 5m # resend same group every 5min
repeat_interval: 4h # repeat ongoing alerts every 4h
receiver: 'default-webhook'
routes:
# P0/P1 → PagerDuty immediate page
- match_re:
severity: "P0|P1"
receiver: 'pagerduty-critical'
group_wait: 10s
repeat_interval: 30m # P0 reminder every 30min
# Payments team → payments Slack channel
- match:
team: payments
receiver: 'slack-payments'
continue: true # also continue matching other routes
# Infrastructure → infra team
- match:
team: infra
receiver: 'slack-infra'
# Inhibition rules (noise reduction: suppress downstream alerts when upstream fails)
inhibit_rules:
- source_match:
severity: P0
team: infra
target_match:
team: payments # when infra is P0, suppress payments team alerts
equal: ['env', 'region']
receivers:
- name: 'pagerduty-critical'
pagerduty_configs:
- routing_key: '${PAGERDUTY_INTEGRATION_KEY}'
description: '{{ .GroupLabels.alertname }}: {{ .CommonAnnotations.summary }}'
severity: '{{ if eq .CommonLabels.severity "P0" }}critical{{ else }}error{{ end }}'
- name: 'slack-payments'
slack_configs:
- api_url: '${SLACK_WEBHOOK_URL}'
channel: '#oncall-payments'
title: '[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.summary }}\n{{ .Annotations.runbook_url }}{{ end }}'
SLO and burn rate
User-facing SLOs (availability, latency, errors) should drive alerting: watch not only instantaneous thresholds but error budget burn—consuming budget too fast in a short window implies breach at the end of the long window if the trend continues.
- Common pattern: multi-window (e.g. 1h + 6h) burn alerts for spikes vs sustained degradation; page-level alerts name the SLO and remaining budget share.
- When a SKILL describes an alert, require the SLI, SLO target, and “first chart or query to open”.
- Non-SLO infra alerts (disk, certificates) still need business context or service-tree nodes for correct routing.
On-call escalation policy configuration example (PagerDuty / OpsGenie equivalent):
# PagerDuty escalation policy (equivalent OpsGenie escalation policy)
# Configuration guide: described as YAML; configure in PagerDuty UI or Terraform
escalation_policy:
name: "Payment Service On-call"
num_loops: 2 # loop 2 times when unacknowledged before escalating to management
rules:
# Level 1: first responder (immediate notification)
- escalation_delay_in_minutes: 0
targets:
- type: schedule
id: "payment-oncall-schedule" # current on-call engineer
# Level 2: unacknowledged after 15min → escalate to L2 expert
- escalation_delay_in_minutes: 15
targets:
- type: user
id: "payment-tech-lead"
- type: schedule
id: "payment-backup-schedule"
# Level 3: still unresolved after 30min → notify management
- escalation_delay_in_minutes: 30
targets:
- type: user
id: "engineering-manager"
# Alertmanager equivalent (use repeat_interval to implement escalation)
# Used in conjunction with PagerDuty escalation_policy
Alert detail page
The alert detail page in monitoring or incident tooling is the first screen for on-call: aggregate what you need to decide in one view, fewer tab hops.
- Title and summary: one line for “what broke vs baseline”; rule name, labels (env, region, tenant).
- Linked panels: SLO/SLI trends, related dashboards, deploy/config timeline for 24h, links to similar past incidents.
- Runbook entry: inline or one-click open for this alert type; placeholder to copy on-call summary into chat or tickets.
Runbooks and escalation
Each critical alert binds a runbook: dependency checks, mitigations (scale, degrade, traffic shift), rollback or flag locations, and when to escalate to tier-2 or leadership.
- Clarify “single responder can execute” vs “must pull others”; document escalation (e.g. paid-tenant impact, write failures > 5 minutes).
- Record facts: timeline, blast radius, actions tried, current state—for handoff and executive summaries.
One-line on-call summary
For channel topics, ticket titles, or status-page drafts: fill the fields to generate a readable English summary of current state.
Format: [severity] service: symptom; current action (append blast radius if provided).
---
name: alerting-oncall
description: Alert severity and on-call runbook templates
model: claude-sonnet-4-5
---
# Alert rule requirements
alert_rule_requirements:
- every alert must include runbook_url
- severity label: P0/P1 (page) | P2/P3 (ticket-level)
- configure for duration (avoid transient noise)
- alert message must explain: what broke + relative to what baseline
# Severity definitions
severity_definition:
P0: core service fully unavailable; immediate page; respond within 15min
P1: core path severely degraded; respond within 1h; PagerDuty
P2: non-core or has workaround; Slack notification; handle on business day
P3: experience degraded; ticket logged; address in next iteration
# Noise reduction strategy
noise_reduction:
- group_by: [alertname, team] (merge same combinations)
- inhibit_rules: suppress downstream alerts when upstream is P0
- maintenance_silences: must link a ticket; no permanent silences
# Response steps
response_steps:
1. Assess severity and whether SLO / burn rate is breached
2. Acknowledge alert; update status (who is on it, next update ETA)
3. Execute runbook: check recent deploys and dependencies
4. Log timeline and decide escalation path
5. After recovery: clear alert, open postmortem ticket