Alerting & on-call

Help Agents align alert policy with severity, blast radius, and customer-facing SLOs, and produce actionable on-call steps: confirm, mitigate, communicate, postmortem.

A SKILL should separate “page immediately” from “ticket-level” events, document routing, silencing windows, deduplication, and dependency suppression to avoid alert storms burning out responders.

Bind a runbook to each critical alert: dependency checks, recent deploys, mitigations (scale, degrade, failover), rollback or feature-flag location, and when to escalate to tier-2 or leadership.

When an Agent assists, emphasize factual logging: timeline, blast radius, actions tried, and current state—for handoff and post-incident reports.

  • Alerts must be explainable against user or business metrics; forbid “CPU high” without service and tenant context.
  • Playbooks should define single-responder authority limits and when a war room is required.
  • Pair with logging and tracing skills: embed typical queries and dashboard link placeholders in runbooks.

On-call response path

  [ Alert fires / page escalates ]
        │
        ▼
  ┌─────────────┐     Ack: who is on it, next update ETA
  │ Triage       │──── Compare to SLO / customer impact / change window
  └─────────────┘
        │
        ▼
  ┌─────────────┐     Runbook: deps, signature queries, recent deploys/config
  │ Mitigate     │──── Mitigate first: throttle, degrade, scale, rollback, flags
  └─────────────┘
        │
        ▼
  ┌─────────────┐     Status: internal channel + external status template if needed
  │ Communicate  │──── Escalate on threshold or missing permissions: tier-2 / mgmt / vendor
  └─────────────┘
        │
        ▼
  ┌─────────────┐     Stable: clear alert, log timeline, open follow-ups
  │ Recover      │──── Cross-check root-cause hypotheses with metrics, logs, traces
  └─────────────┘

Keep one primary thread per incident: merge duplicate alerts into it; on handoff, pin current hypothesis, actions taken, and next steps in the ticket or channel topic.

Alert routing and noise reduction

Routing decides who gets notified, on which channel, at what priority: split by service ownership, tenant tier, environment (prod/stage), and customer contracts; pager and chat bots should target the current on-call rotation, not a static individual.

  • Grouping and suppression: collapse repeated alerts from the same source into counts; suppress child services when upstream is failing.
  • Silences and maintenance: planned changes use time-bounded silences with reason; no “silent forever” without a tracked ticket.
  • Escalation ladder: unacknowledged after N minutes, severity bump, or customer-visible outage triggers tier-2 or phone.

Complete Prometheus alert rules YAML example:

# prometheus/alerts/payment-svc.yml
groups:
  - name: payment-svc-slo
    interval: 30s
    rules:
      # P1: SLO violation - high error rate
      - alert: PaymentHighErrorRate
        expr: |
          sum(rate(http_requests_total{job="payment-svc",code=~"5.."}[5m]))
          / sum(rate(http_requests_total{job="payment-svc"}[5m]))
          > 0.01
        for: 5m
        labels:
          severity: "P1"
          team: "payments"
          slo: "availability"
        annotations:
          summary: "Payment API error rate {{ $value | humanizePercentage }} exceeds SLO"
          description: "Error rate has been above 1% for 5 minutes. Current: {{ $value | humanizePercentage }}"
          runbook_url: "https://wiki/runbooks/payment-svc/high-error-rate"
          dashboard_url: "https://grafana/d/payment-golden-signals"

      # P2: latency alert
      - alert: PaymentHighLatency
        expr: |
          histogram_quantile(0.99,
            sum(rate(http_request_duration_seconds_bucket{job="payment-svc"}[5m]))
            by (le)
          ) > 0.5
        for: 10m
        labels:
          severity: "P2"
          team: "payments"
        annotations:
          summary: "Payment p99 latency {{ $value | humanizeDuration }} exceeds 500ms"
          runbook_url: "https://wiki/runbooks/payment-svc/high-latency"

      # Infrastructure alert (disk)
      - alert: DiskSpaceLow
        expr: |
          (node_filesystem_free_bytes{mountpoint="/"}
          / node_filesystem_size_bytes{mountpoint="/"}) < 0.15
        for: 5m
        labels:
          severity: "P2"
          team: "infra"
        annotations:
          summary: "Disk space below 15% on {{ $labels.instance }}"

Alertmanager tiered routing and noise-reduction configuration:

# alertmanager.yml
global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'team']   # group by team; merge same alert combinations
  group_wait: 30s                    # wait 30s to aggregate alerts in the same group
  group_interval: 5m                 # resend same group every 5min
  repeat_interval: 4h               # repeat ongoing alerts every 4h
  receiver: 'default-webhook'

  routes:
    # P0/P1 → PagerDuty immediate page
    - match_re:
        severity: "P0|P1"
      receiver: 'pagerduty-critical'
      group_wait: 10s
      repeat_interval: 30m          # P0 reminder every 30min

    # Payments team → payments Slack channel
    - match:
        team: payments
      receiver: 'slack-payments'
      continue: true                 # also continue matching other routes

    # Infrastructure → infra team
    - match:
        team: infra
      receiver: 'slack-infra'

# Inhibition rules (noise reduction: suppress downstream alerts when upstream fails)
inhibit_rules:
  - source_match:
      severity: P0
      team: infra
    target_match:
      team: payments          # when infra is P0, suppress payments team alerts
    equal: ['env', 'region']

receivers:
  - name: 'pagerduty-critical'
    pagerduty_configs:
      - routing_key: '${PAGERDUTY_INTEGRATION_KEY}'
        description: '{{ .GroupLabels.alertname }}: {{ .CommonAnnotations.summary }}'
        severity: '{{ if eq .CommonLabels.severity "P0" }}critical{{ else }}error{{ end }}'

  - name: 'slack-payments'
    slack_configs:
      - api_url: '${SLACK_WEBHOOK_URL}'
        channel: '#oncall-payments'
        title: '[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.summary }}\n{{ .Annotations.runbook_url }}{{ end }}'

SLO and burn rate

User-facing SLOs (availability, latency, errors) should drive alerting: watch not only instantaneous thresholds but error budget burn—consuming budget too fast in a short window implies breach at the end of the long window if the trend continues.

  • Common pattern: multi-window (e.g. 1h + 6h) burn alerts for spikes vs sustained degradation; page-level alerts name the SLO and remaining budget share.
  • When a SKILL describes an alert, require the SLI, SLO target, and “first chart or query to open”.
  • Non-SLO infra alerts (disk, certificates) still need business context or service-tree nodes for correct routing.

On-call escalation policy configuration example (PagerDuty / OpsGenie equivalent):

# PagerDuty escalation policy (equivalent OpsGenie escalation policy)
# Configuration guide: described as YAML; configure in PagerDuty UI or Terraform

escalation_policy:
  name: "Payment Service On-call"
  num_loops: 2              # loop 2 times when unacknowledged before escalating to management

  rules:
    # Level 1: first responder (immediate notification)
    - escalation_delay_in_minutes: 0
      targets:
        - type: schedule
          id: "payment-oncall-schedule"  # current on-call engineer

    # Level 2: unacknowledged after 15min → escalate to L2 expert
    - escalation_delay_in_minutes: 15
      targets:
        - type: user
          id: "payment-tech-lead"
        - type: schedule
          id: "payment-backup-schedule"

    # Level 3: still unresolved after 30min → notify management
    - escalation_delay_in_minutes: 30
      targets:
        - type: user
          id: "engineering-manager"

# Alertmanager equivalent (use repeat_interval to implement escalation)
# Used in conjunction with PagerDuty escalation_policy

Alert detail page

The alert detail page in monitoring or incident tooling is the first screen for on-call: aggregate what you need to decide in one view, fewer tab hops.

  • Title and summary: one line for “what broke vs baseline”; rule name, labels (env, region, tenant).
  • Linked panels: SLO/SLI trends, related dashboards, deploy/config timeline for 24h, links to similar past incidents.
  • Runbook entry: inline or one-click open for this alert type; placeholder to copy on-call summary into chat or tickets.

Runbooks and escalation

Each critical alert binds a runbook: dependency checks, mitigations (scale, degrade, traffic shift), rollback or flag locations, and when to escalate to tier-2 or leadership.

  • Clarify “single responder can execute” vs “must pull others”; document escalation (e.g. paid-tenant impact, write failures > 5 minutes).
  • Record facts: timeline, blast radius, actions tried, current state—for handoff and executive summaries.

One-line on-call summary

For channel topics, ticket titles, or status-page drafts: fill the fields to generate a readable English summary of current state.

Format: [severity] service: symptom; current action (append blast radius if provided).

---
name: alerting-oncall
description: Alert severity and on-call runbook templates
model: claude-sonnet-4-5
---

# Alert rule requirements
alert_rule_requirements:
  - every alert must include runbook_url
  - severity label: P0/P1 (page) | P2/P3 (ticket-level)
  - configure for duration (avoid transient noise)
  - alert message must explain: what broke + relative to what baseline

# Severity definitions
severity_definition:
  P0: core service fully unavailable; immediate page; respond within 15min
  P1: core path severely degraded; respond within 1h; PagerDuty
  P2: non-core or has workaround; Slack notification; handle on business day
  P3: experience degraded; ticket logged; address in next iteration

# Noise reduction strategy
noise_reduction:
  - group_by: [alertname, team] (merge same combinations)
  - inhibit_rules: suppress downstream alerts when upstream is P0
  - maintenance_silences: must link a ticket; no permanent silences

# Response steps
response_steps:
  1. Assess severity and whether SLO / burn rate is breached
  2. Acknowledge alert; update status (who is on it, next update ETA)
  3. Execute runbook: check recent deploys and dependencies
  4. Log timeline and decide escalation path
  5. After recovery: clear alert, open postmortem ticket

Back to skills More skills