Monitoring & dashboards
Guide Agents from user journeys to SLIs, SLOs and error budgets, and to RED/USE-style metrics, log correlation, and Grafana-as-code panel templates.
A SKILL should require alerts to include runbook links, severity, and silencing rules; avoid “charts only, no thresholds” or duplicate alert storms; if multi-window burn rate is used, document parameter sources.
Metric naming follows Prometheus conventions (unit suffix, _total counters); histogram buckets align with SLO queries; forbid high-cardinality tags explicitly.
Correlate distributed traces with logs via trace id; default sampling trade-offs; dashboard variables and multi-env datasource conventions.
Observability design flow
[ User journey / critical path ]
│
▼
┌─────────────┐ Candidate SLIs: availability, latency percentiles, correctness, throughput
│ SLI → SLO │──── Error budget window and policy (quarterly / rolling)
└─────────────┘
│
▼
┌─────────────┐ RED / USE mapping + log / trace correlation fields
│ Metrics │──── Naming, buckets, label allowlists and cardinality review
└─────────────┘
│
▼
┌─────────────┐ By role: on-call / product / leadership; variables and env
│ Grafana │──── JSON export or Terraform / repo path
└─────────────┘
│
▼
┌─────────────┐ Thresholds, routing, silences, runbooks; burn rate params documented
│ Alerts │──── Budget exhausted → release freeze or link to release skill
└─────────────┘
RED and USE metrics
For request-driven microservices prioritize RED; for hosts, queues, and storage capacity add USE. When a SKILL outputs metrics, map each to SLO queries and avoid duplicate charts.
RED: rate, errors, duration
- Rate: requests or jobs per second; align with capacity, rate limits, autoscale.
- Errors: HTTP 5xx, business failure codes, timeouts; numerator/denominator match the SLO.
- Duration: p50 / p95 / p99; histogram buckets shared with alert PromQL.
USE: utilization, saturation, errors
- Utilization: CPU, memory, pool occupancy; distinguish instance vs cluster.
- Saturation: queue depth, wait time, thread pool backlog—“how full” signals.
- Errors: device / driver / disk IO failures (complements RED errors).
Golden Signals standard dashboard layout (Grafana JSON structure example):
// Grafana Dashboard JSON snippet: Golden Signals layout
// Full configuration including SLO panel (availability)
{
"title": "payment-svc — Golden Signals",
"uid": "payment-golden-signals-v1",
"tags": ["payment", "slo", "oncall"],
"templating": {
"list": [
{
"name": "env",
"type": "custom",
"options": [{"value": "prod"}, {"value": "staging"}],
"current": {"value": "prod"}
}
]
},
"panels": [
{
"title": "Availability SLO (99.9%)",
"type": "stat",
"gridPos": {"x": 0, "y": 0, "w": 6, "h": 4},
"targets": [{
"expr": "sum(rate(http_requests_total{job='payment-svc',env='$env',code!~'5..'}[28d])) / sum(rate(http_requests_total{job='payment-svc',env='$env'}[28d])) * 100",
"legendFormat": "Availability %"
}],
"fieldConfig": {
"defaults": {
"thresholds": {
"steps": [
{"value": 0, "color": "red"},
{"value": 99.5, "color": "yellow"},
{"value": 99.9, "color": "green"}
]
},
"unit": "percent"
}
},
"options": {"reduceOptions": {"calcs": ["lastNotNull"]}}
},
{
"title": "Request Rate (RPS)",
"type": "timeseries",
"gridPos": {"x": 6, "y": 0, "w": 9, "h": 4},
"targets": [{
"expr": "sum(rate(http_requests_total{job='payment-svc',env='$env'}[5m])) by (code)",
"legendFormat": "HTTP {{code}}"
}]
},
{
"title": "Latency p99 vs SLO (200ms)",
"type": "timeseries",
"gridPos": {"x": 15, "y": 0, "w": 9, "h": 4},
"targets": [{
"expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{job='payment-svc',env='$env'}[5m])) by (le))",
"legendFormat": "p99 latency"
}]
}
]
}
SLI, SLO, and error budget
When switching focus below, the SKILL body should spell out measurement window, compliance definition, and whether multi-window burn rate follows Google SRE conventions.
SLI candidates and measurement window
Map from the journey: successful request ratio, latency percentiles, data freshness, etc.; define aggregation window (5m / 1h) and what counts as “good”. Histogram le buckets must cover SLO thresholds to avoid extrapolation error.
# Availability SLO: Prometheus query (28-day rolling window)
# Definition: successful requests / total requests (excluding 4xx client errors)
sum(rate(http_requests_total{job="payment-svc",code!~"[45].."}[28d]))
/
sum(rate(http_requests_total{job="payment-svc"}[28d]))
# Equivalent Recording Rule (reduces query load)
# prometheus rules:
- record: job:slo_availability:ratio_rate28d
expr: |
sum(rate(http_requests_total{job="payment-svc",code!~"[45].."}[28d]))
/ sum(rate(http_requests_total{job="payment-svc"}[28d]))
# Latency SLO: p99 < 200ms (histogram query)
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket{job="payment-svc"}[5m]))
by (le)
) < 0.2 # 200ms SLO threshold
Error budget and exemptions
Budget burn ties to release cadence; document freeze conditions, emergency exemption approval, and post-hoc accounting. Dashboards should show remaining budget and window burn trend, not only instantaneous SLI.
# Error budget calculation (99.9% SLO, 28-day window)
# Total allowed error minutes = 28 * 24 * 60 * (1 - 0.999) = 40.32 min
# Remaining budget query:
(1 - job:slo_availability:ratio_rate28d{job="payment-svc"}) * 28 * 24 * 60
# Result in minutes: error budget consumed so far
# Error budget burn percentage (for dashboard display)
(1 - job:slo_availability:ratio_rate28d{job="payment-svc"})
/ (1 - 0.999)
* 100 # Result > 100 means budget is exceeded
Multi-window burn rate alerts
Short + long windows (e.g. 1h + 6h) reduce false positives; the SKILL must list windows, threshold multipliers, and data source (recording rule vs direct query). Do not copy unexplained magic numbers.
# Burn Rate multi-window alert (Google SRE Book approach)
# Alert condition: both 1h AND 6h window burn rate exceed threshold (reduces false positives)
groups:
- name: slo.payment-svc
rules:
- alert: PaymentSLOBurnRateHigh
expr: |
(
# Short window (1h) burn rate > 14.4x (consuming 14.4x the normal error rate)
sum(rate(http_requests_total{job="payment-svc",code=~"5.."}[1h]))
/ sum(rate(http_requests_total{job="payment-svc"}[1h]))
/ (1 - 0.999) > 14.4
) and (
# Long window (6h) burn rate > 6x (prevents false positives from brief spikes)
sum(rate(http_requests_total{job="payment-svc",code=~"5.."}[6h]))
/ sum(rate(http_requests_total{job="payment-svc"}[6h]))
/ (1 - 0.999) > 6
)
for: 2m
labels:
severity: critical
slo: availability
annotations:
summary: "Payment SLO burn rate critical: consuming 14.4x error budget in 2h"
runbook_url: "https://wiki/runbooks/payment-slo-burn"
Align with alerting & on-call: every SLO-related alert binds a runbook, severity, and silencing rules; embed dashboard links as runbook placeholders.
Alerting, dashboard layers, and cardinality
- Dashboards: layer by role (on-call, product, leadership) to avoid single-screen overload.
- Import/export: Grafana JSON or Terraform provider paths.
- Post-incident: release freeze when budget is exhausted, or flow linked to a release skill.
- High-cardinality dimensions (e.g. raw user_id) are forbidden unless explicitly approved in the SKILL.
Panel title slug export
Generate stable slugs from titles inside .dash-slug-panel blocks (prefer data-dash-slug, else derive from Latin characters in the title) for IaC, Grafana uid prefixes, or SKILL appendices.
If a title has no Latin segment, maintain machine-readable ids in HTML data-dash-slug; paste the output into repo docs as a team convention table.
---
name: monitoring-dashboards
description: SLI/SLO, alerting, and Grafana dashboard design
model: claude-sonnet-4-5
---
# Steps
steps:
1. Map user journey to SLIs (availability/latency/correctness)
2. Set SLO target values and error budget window (28-day rolling)
3. Design Prometheus metric naming (unit suffix / _total counters)
4. Define histogram buckets (cover SLO thresholds, avoid extrapolation error)
5. Configure Recording Rules (reduce query complexity)
6. Design Grafana panels (Golden Signals: Rate/Errors/Duration/Saturation)
7. Configure multi-window burn rate alerts (1h+6h combination)
8. Set up alert routing with runbook links
# Metric naming conventions
naming_convention:
counter: http_requests_total (_total suffix)
gauge: process_memory_bytes (unit suffix)
histogram: http_request_duration_seconds (seconds as unit)
forbidden_labels: [user_id, email, ip_address] # high-cardinality forbidden