Canary release

Route a small slice of traffic to the new version, then step up or auto-rollback on metrics; pairs well with feature flags, service mesh, or Ingress weights—have the agent produce staged policies.

The skill should define initial percentage, minimum observation time per stage, promotion criteria (errors, timeouts, conversion), and veto metrics (e.g. payment failure spike); and how canary instances or pods are labeled so logs and traces can be compared to baseline.

Versus blue-green: canary stresses gradual rollout and programmable gates; the agent should call out long-lived connections, cache skew, and uneven geo traffic. Roll back by zeroing weight or flipping flags before full redeploy.

  • Before widening traffic, ensure monitoring and alerts tag or bucket the new version.
  • Human gates: who approves full rollout and which dashboards justify it.

Canary pipeline (deploy → split → observe)

  [ Build / deploy canary replicas ]
        │
        ▼
  ┌─────────────┐     Initial weight: 1%–5% (or stable user buckets)
  │ Split traffic│──── Baseline vs canary: same routing, version labels
  └─────────────┘
        │
        ▼
  ┌─────────────┐     Window: min observation + SLO comparison per stage
  │ Metric gates │──── Promote: step up; veto: auto-rollback or zero weight
  └─────────────┘
        │
        ▼
  ┌─────────────┐     Before full: human approval + final dashboard capture
  │ Full or roll │──── Runbook: who acts, how to verify recovery
  └─────────────┘

At each stage end, compare like-for-like metrics: by version, route, tenant—avoid treating global noise as canary signal.

Traffic splitting and implementation

Split at L7 (Ingress / API gateway weights), data plane (mesh VirtualService), or in-app (stable buckets from user id hash). The skill should name a single control plane so stacked weights do not produce surprising blends.

Common patterns

  • Ingress / Gateway: weight or header routing to canary Service
  • Mesh: subset + percentage or match rules
  • Feature flags: cohort-based logic, combinable with deployed canary

Consistency

  • Sticky sessions: same user on same version unless you explicitly reshuffle
  • Long-lived / WebSocket: after promotion, old connections may still hit old code—plan for it
  • CDN / edge: static asset versions aligned with API contracts
# Nginx canary traffic splitting configuration example

# Method 1: Percentage-based split (random sampling)
upstream app_stable {
    server stable.internal:8080;
}
upstream app_canary {
    server canary.internal:8080;
}

split_clients "${remote_addr}${http_user_agent}" $app_variant {
    5%    canary;   # 5% goes to canary
    *     stable;   # 95% stays on stable
}

server {
    location / {
        proxy_pass http://app_$app_variant;
    }
}

# Method 2: Header-based routing (for targeted testing)
server {
    location / {
        if ($http_x_canary = "true") {
            proxy_pass http://app_canary;
            break;
        }
        proxy_pass http://app_stable;
    }
}

Metric gates and rollback

For each stage define promotion criteria and immediate rollback. Promotion is often relative to baseline (error uplift, P99 delta caps); rollback hits hard limits (payment failure rate, fatal log volume, saturation).

  • Relative: canary bucket error rate ≤ baseline + ε, or experiment conversion ≥ control − δ.
  • Absolute: 5xx share, timeouts, queue depth, breaker trips below thresholds.
  • Sample size: low-traffic services need longer windows or larger initial % to avoid variance traps.
  • Automation: analysis job or pipeline reads metrics APIs; on failure call APIs to zero weight or trigger Argo rollback.

Define veto metrics with the business; when tech SLOs and business KPIs both fail, default to protecting revenue and trust.

# Prometheus query: canary vs baseline error rate comparison
# Canary error rate
rate(http_requests_total{env="canary",status=~"5.."}[5m])
  / rate(http_requests_total{env="canary"}[5m])

# Baseline (stable) error rate
rate(http_requests_total{env="stable",status=~"5.."}[5m])
  / rate(http_requests_total{env="stable"}[5m])

# Gate check: canary error rate must not exceed 2x the baseline
# If canary_err_rate > baseline_err_rate * 2: trigger rollback
# Argo Rollouts canary configuration example
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: my-app
spec:
  strategy:
    canary:
      steps:
        - setWeight: 5      # 5% to canary
        - pause: {duration: 10m}
        - setWeight: 20     # 20% to canary
        - pause: {duration: 10m}
        - setWeight: 50     # 50% to canary
        - pause: {duration: 10m}
        # Full rollout after all analysis passes
      analysis:
        templates:
          - templateName: error-rate-check
        startingStep: 1
        args:
          - name: service-name
            value: my-app-canary

Stage percentage calculator

From start %, target %, number of stages, and curve type, produce cumulative canary % and per-stage delta (% of total traffic) for runbooks or promotion scripts.

Parameters

Linear interpolates evenly on the percentage axis; exponential interpolates on a log scale, suited to climbing quickly from small %.

Stage Cumulative canary % Delta this stage %

Values round to one decimal; if start ≥ target you get an adjustment prompt. Exponential needs start ≥ 0.01% or we fall back to linear. Paste into sheets or scripts; map to mesh/Ingress vendor weights separately.

---
name: canary-release
description: Canary release: traffic splitting, metric gates, and automated rollback
version: 2.0
---
# Deploy canary
- Deploy canary version alongside stable (same cluster or separate namespace)
- Configure traffic split: start at 1-5% (internal users first)
- Confirm canary health checks pass before routing production traffic

# Traffic split stages
- Stage 1: 1-5% (internal / beta users)
- Stage 2: 10-20% (stable for 10 min, observe metrics)
- Stage 3: 50% (hold for at least 10 min)
- Stage 4: 100% (full promotion)

# Metric gates (check before advancing each stage)
- Error rate: canary < baseline * 2
- P99 latency: canary < baseline * 1.5
- Business events: checkout / login success rate within normal range

# Rollback conditions
- Any gate threshold exceeded → immediate rollback to stable
- Argo Rollouts: kubectl argo rollouts abort <rollout-name>
- Nginx: update split_clients weight back to 0% for canary

All skills More skills