Monitoring & dashboards

Guide Agents from user journeys to SLIs, SLOs and error budgets, and to RED/USE-style metrics, log correlation, and Grafana-as-code panel templates.

A SKILL should require alerts to include runbook links, severity, and silencing rules; avoid “charts only, no thresholds” or duplicate alert storms; if multi-window burn rate is used, document parameter sources.

Metric naming follows Prometheus conventions (unit suffix, _total counters); histogram buckets align with SLO queries; forbid high-cardinality tags explicitly.

Correlate distributed traces with logs via trace id; default sampling trade-offs; dashboard variables and multi-env datasource conventions.

Observability design flow

  [ User journey / critical path ]
        │
        ▼
  ┌─────────────┐     Candidate SLIs: availability, latency percentiles, correctness, throughput
  │ SLI → SLO    │──── Error budget window and policy (quarterly / rolling)
  └─────────────┘
        │
        ▼
  ┌─────────────┐     RED / USE mapping + log / trace correlation fields
  │  Metrics     │──── Naming, buckets, label allowlists and cardinality review
  └─────────────┘
        │
        ▼
  ┌─────────────┐     By role: on-call / product / leadership; variables and env
  │ Grafana      │──── JSON export or Terraform / repo path
  └─────────────┘
        │
        ▼
  ┌─────────────┐     Thresholds, routing, silences, runbooks; burn rate params documented
  │  Alerts      │──── Budget exhausted → release freeze or link to release skill
  └─────────────┘

RED and USE metrics

For request-driven microservices prioritize RED; for hosts, queues, and storage capacity add USE. When a SKILL outputs metrics, map each to SLO queries and avoid duplicate charts.

RED: rate, errors, duration

Rate: requests or jobs per second; align with capacity, rate limits, autoscale.
Errors: HTTP 5xx, business failure codes, timeouts; numerator/denominator match the SLO.
Duration: p50 / p95 / p99; histogram buckets shared with alert PromQL.

USE: utilization, saturation, errors

Utilization: CPU, memory, pool occupancy; distinguish instance vs cluster.
Saturation: queue depth, wait time, thread pool backlog—“how full” signals.
Errors: device / driver / disk IO failures (complements RED errors).

Golden Signals standard dashboard layout (Grafana JSON structure example):

// Grafana Dashboard JSON snippet: Golden Signals layout
// Full configuration including SLO panel (availability)
{
  "title": "payment-svc — Golden Signals",
  "uid": "payment-golden-signals-v1",
  "tags": ["payment", "slo", "oncall"],
  "templating": {
    "list": [
      {
        "name": "env",
        "type": "custom",
        "options": [{"value": "prod"}, {"value": "staging"}],
        "current": {"value": "prod"}
      }
    ]
  },
  "panels": [
    {
      "title": "Availability SLO (99.9%)",
      "type": "stat",
      "gridPos": {"x": 0, "y": 0, "w": 6, "h": 4},
      "targets": [{
        "expr": "sum(rate(http_requests_total{job='payment-svc',env='$env',code!~'5..'}[28d])) / sum(rate(http_requests_total{job='payment-svc',env='$env'}[28d])) * 100",
        "legendFormat": "Availability %"
      }],
      "fieldConfig": {
        "defaults": {
          "thresholds": {
            "steps": [
              {"value": 0, "color": "red"},
              {"value": 99.5, "color": "yellow"},
              {"value": 99.9, "color": "green"}
            ]
          },
          "unit": "percent"
        }
      },
      "options": {"reduceOptions": {"calcs": ["lastNotNull"]}}
    },
    {
      "title": "Request Rate (RPS)",
      "type": "timeseries",
      "gridPos": {"x": 6, "y": 0, "w": 9, "h": 4},
      "targets": [{
        "expr": "sum(rate(http_requests_total{job='payment-svc',env='$env'}[5m])) by (code)",
        "legendFormat": "HTTP {{code}}"
      }]
    },
    {
      "title": "Latency p99 vs SLO (200ms)",
      "type": "timeseries",
      "gridPos": {"x": 15, "y": 0, "w": 9, "h": 4},
      "targets": [{
        "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{job='payment-svc',env='$env'}[5m])) by (le))",
        "legendFormat": "p99 latency"
      }]
    }
  ]
}

SLI, SLO, and error budget

When switching focus below, the SKILL body should spell out measurement window, compliance definition, and whether multi-window burn rate follows Google SRE conventions.

SLI candidates and measurement window

Map from the journey: successful request ratio, latency percentiles, data freshness, etc.; define aggregation window (5m / 1h) and what counts as “good”. Histogram le buckets must cover SLO thresholds to avoid extrapolation error.

# Availability SLO: Prometheus query (28-day rolling window)
# Definition: successful requests / total requests (excluding 4xx client errors)
sum(rate(http_requests_total{job="payment-svc",code!~"[45].."}[28d]))
/
sum(rate(http_requests_total{job="payment-svc"}[28d]))

# Equivalent Recording Rule (reduces query load)
# prometheus rules:
- record: job:slo_availability:ratio_rate28d
  expr: |
    sum(rate(http_requests_total{job="payment-svc",code!~"[45].."}[28d]))
    / sum(rate(http_requests_total{job="payment-svc"}[28d]))

# Latency SLO: p99 < 200ms (histogram query)
histogram_quantile(0.99,
  sum(rate(http_request_duration_seconds_bucket{job="payment-svc"}[5m]))
  by (le)
) < 0.2  # 200ms SLO threshold

Error budget and exemptions

Budget burn ties to release cadence; document freeze conditions, emergency exemption approval, and post-hoc accounting. Dashboards should show remaining budget and window burn trend, not only instantaneous SLI.

# Error budget calculation (99.9% SLO, 28-day window)
# Total allowed error minutes = 28 * 24 * 60 * (1 - 0.999) = 40.32 min
# Remaining budget query:
(1 - job:slo_availability:ratio_rate28d{job="payment-svc"}) * 28 * 24 * 60
  # Result in minutes: error budget consumed so far

# Error budget burn percentage (for dashboard display)
(1 - job:slo_availability:ratio_rate28d{job="payment-svc"})
/ (1 - 0.999)
* 100  # Result > 100 means budget is exceeded

Multi-window burn rate alerts

Short + long windows (e.g. 1h + 6h) reduce false positives; the SKILL must list windows, threshold multipliers, and data source (recording rule vs direct query). Do not copy unexplained magic numbers.

# Burn Rate multi-window alert (Google SRE Book approach)
# Alert condition: both 1h AND 6h window burn rate exceed threshold (reduces false positives)
groups:
  - name: slo.payment-svc
    rules:
    - alert: PaymentSLOBurnRateHigh
      expr: |
        (
          # Short window (1h) burn rate > 14.4x (consuming 14.4x the normal error rate)
          sum(rate(http_requests_total{job="payment-svc",code=~"5.."}[1h]))
          / sum(rate(http_requests_total{job="payment-svc"}[1h]))
          / (1 - 0.999) > 14.4
        ) and (
          # Long window (6h) burn rate > 6x (prevents false positives from brief spikes)
          sum(rate(http_requests_total{job="payment-svc",code=~"5.."}[6h]))
          / sum(rate(http_requests_total{job="payment-svc"}[6h]))
          / (1 - 0.999) > 6
        )
      for: 2m
      labels:
        severity: critical
        slo: availability
      annotations:
        summary: "Payment SLO burn rate critical: consuming 14.4x error budget in 2h"
        runbook_url: "https://wiki/runbooks/payment-slo-burn"

Align with alerting & on-call: every SLO-related alert binds a runbook, severity, and silencing rules; embed dashboard links as runbook placeholders.

Alerting, dashboard layers, and cardinality

Dashboards: layer by role (on-call, product, leadership) to avoid single-screen overload.
Import/export: Grafana JSON or Terraform provider paths.
Post-incident: release freeze when budget is exhausted, or flow linked to a release skill.
High-cardinality dimensions (e.g. raw user_id) are forbidden unless explicitly approved in the SKILL.

Panel title slug export

Generate stable slugs from titles inside .dash-slug-panel blocks (prefer data-dash-slug, else derive from Latin characters in the title) for IaC, Grafana uid prefixes, or SKILL appendices.

If a title has no Latin segment, maintain machine-readable ids in HTML data-dash-slug; paste the output into repo docs as a team convention table.

---
name: monitoring-dashboards
description: SLI/SLO, alerting, and Grafana dashboard design
model: claude-sonnet-4-5
---

# Steps
steps:
  1. Map user journey to SLIs (availability/latency/correctness)
  2. Set SLO target values and error budget window (28-day rolling)
  3. Design Prometheus metric naming (unit suffix / _total counters)
  4. Define histogram buckets (cover SLO thresholds, avoid extrapolation error)
  5. Configure Recording Rules (reduce query complexity)
  6. Design Grafana panels (Golden Signals: Rate/Errors/Duration/Saturation)
  7. Configure multi-window burn rate alerts (1h+6h combination)
  8. Set up alert routing with runbook links

# Metric naming conventions
naming_convention:
  counter: http_requests_total (_total suffix)
  gauge: process_memory_bytes (unit suffix)
  histogram: http_request_duration_seconds (seconds as unit)
  forbidden_labels: [user_id, email, ip_address]  # high-cardinality forbidden

Back to skills More skills