Logging & distributed tracing

Guide Agents to design structured logs, propagate trace/span context, and correlate requests, errors, and business identifiers across services to shorten time to root cause.

A SKILL should define log fields (time, level, service name, trace_id, span_id, redacted user or tenant id), sampling policy, and sensitive-data filters—avoid logging secrets or full PII.

For tracing, document gateway or sidecar injection, async and message propagation, and how log platforms and APM join on trace_id.

For incidents, Agents can narrow by time and error fingerprint first, then pull upstream/downstream spans along the trace, then align with code paths and config change timelines.

  • Hard requirements: critical paths carry request_id / trace_id; error logs include actionable detail, not only stack summaries.
  • Align with metrics: high-cardinality labels and full debug logs are anti-patterns the SKILL should forbid or throttle.
  • Link to on-call playbooks: typical queries (e.g. service + status + trace_id).

Distributed context flow (skill-flow-block)

  [ Ingress: gateway / LB / sidecar ]
                    │
                    ▼
         [ Extract or generate trace context (W3C traceparent / vendor headers) ]
                    │
                    ▼
    [ In service: create root or child span; bind async boundaries (task / queue) ]
                    │
           ┌────────┴────────┐
           ▼                 ▼
  [ Structured log: same record carries trace_id + span_id + business keys ]   [ Outbound: inject downstream headers or message attrs ]
           │                 │
           └────────┬────────┘
                    ▼
         [ Store: log index + trace backend; join on trace_id ]
On review, check three things: context created or inherited at entry, continued across async and messages, and log fields use the same id semantics as spans (avoid a random “request id” that never joins traces).

Correlation: trace, span, and request id

  • Trace ID: identifies the full distributed request across services; in W3C Trace Context it is 32 hex chars. One trace has many spans.
  • Span ID: identifies one unit of work in the trace (often 16 hex); parent/child spans form a tree with time bounds.
  • Correlation / request id: a business or gateway key that may coexist with trace—for support tickets, idempotency, or legacy alignment. The SKILL should define mapping (e.g. gateway generates request id and writes traceparent or a log field) to avoid same name, different meaning.
  • Queues and batch: carry W3C carrier or equivalent in message headers or envelope; consumer spans parent on producer span with the same trace id.

Structured log JSON format standard (required fields and examples):

// Structured log JSON format standard
// Required fields (consistent across all services)
{
  "timestamp": "2024-03-15T14:02:33.421Z",   // ISO 8601 UTC
  "level": "ERROR",                            // ERROR|WARN|INFO|DEBUG
  "service": {
    "name": "payment-svc",
    "version": "2.4.1",
    "instance": "payment-svc-7d9f8b-xk2p9"   // pod/instance id
  },
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "span_id": "00f067aa0ba902b7",
  "message": "Database connection pool exhausted",

  // Required for error-level logs
  "error": {
    "type": "PoolExhaustedException",
    "message": "Connection pool exhausted after 30000ms",
    "stack_fingerprint": "sha256:a3b4c5..."  // full stack in trace backend; only fingerprint here
  },

  // Business context (redacted)
  "request": {
    "id": "req-8f4a2d1c",                    // request_id (not user id)
    "method": "POST",
    "path": "/api/v1/payments"               // no query params (may contain sensitive data)
  },

  // Optional: frequently accessed fields
  "user_id_hash": "sha256:f8a3...",          // hashed, not plaintext
  "tenant_id": "tenant-123"                  // for multi-tenant routing
}

// Log level guidelines:
// ERROR: production issues requiring immediate attention (service unavailable, data loss risk)
// WARN:  potential problems but service still running (degraded, retry succeeded, near threshold)
// INFO:  key business events (request completed, state changed, start/stop)
// DEBUG: detailed troubleshooting info (dev/staging only; off by default in production)

OpenTelemetry integration code

Node.js instrumentation (using @opentelemetry/sdk-node):

// Node.js OpenTelemetry initialization (run first at app entry)
// instrumentation.js
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { Resource } from '@opentelemetry/resources';
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';

const sdk = new NodeSDK({
  resource: new Resource({
    [SemanticResourceAttributes.SERVICE_NAME]: 'payment-svc',
    [SemanticResourceAttributes.SERVICE_VERSION]: process.env.APP_VERSION,
  }),
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://otel-collector:4318/v1/traces',
  }),
  instrumentations: [
    getNodeAutoInstrumentations({
      '@opentelemetry/instrumentation-http': { enabled: true },
      '@opentelemetry/instrumentation-pg': { enabled: true },
    }),
  ],
});
sdk.start();

// Manually create a Span and pass trace context
import { trace, context } from '@opentelemetry/api';

async function processPayment(req, paymentData) {
  const tracer = trace.getTracer('payment-svc');
  const span = tracer.startSpan('payment.process', {
    attributes: {
      'payment.amount': paymentData.amount,
      'payment.currency': paymentData.currency,
      'http.method': req.method,
    }
  });

  return context.with(trace.setSpan(context.active(), span), async () => {
    try {
      const result = await chargeCard(paymentData);
      span.setStatus({ code: SpanStatusCode.OK });
      return result;
    } catch (err) {
      span.recordException(err);
      span.setStatus({ code: SpanStatusCode.ERROR, message: err.message });
      throw err;
    } finally {
      span.end();
    }
  });
}

Python OpenTelemetry instrumentation:

# Python OpenTelemetry integration (FastAPI example)
# requirements: opentelemetry-sdk opentelemetry-instrumentation-fastapi

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
import structlog

# Initialize Tracer
provider = TracerProvider(
    resource=Resource.create({
        "service.name": "payment-svc",
        "service.version": os.getenv("APP_VERSION", "unknown"),
    })
)
provider.add_span_processor(BatchSpanProcessor(
    OTLPSpanExporter(endpoint="http://otel-collector:4318/v1/traces")
))
trace.set_tracer_provider(provider)

# Auto-instrument FastAPI
app = FastAPI()
FastAPIInstrumentor.instrument_app(app)

# Structured log + trace_id correlation
def get_logger():
    span = trace.get_current_span()
    ctx = span.get_span_context()
    return structlog.get_logger().bind(
        trace_id=format(ctx.trace_id, '032x') if ctx.is_valid else None,
        span_id=format(ctx.span_id, '016x') if ctx.is_valid else None,
    )

@app.post("/payments")
async def create_payment(payment: PaymentRequest):
    log = get_logger()
    tracer = trace.get_tracer("payment-svc")
    with tracer.start_as_current_span("payment.create") as span:
        span.set_attribute("payment.amount", payment.amount)
        log.info("payment.processing", amount=payment.amount)
        # ... business logic

Trace ID propagation in HTTP requests (traceparent injection):

// Propagating trace context to downstream HTTP calls (Node.js fetch)
import { propagation, context } from '@opentelemetry/api';
import { W3CTraceContextPropagator } from '@opentelemetry/core';

// Outbound: inject traceparent header automatically
async function callDownstreamService(path: string, body: unknown) {
  const headers: Record = {
    'Content-Type': 'application/json',
  };
  // Inject current trace context into HTTP headers
  propagation.inject(context.active(), headers);
  // After injection, headers contain:
  // traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01

  return fetch(`http://order-svc${path}`, {
    method: 'POST',
    headers,
    body: JSON.stringify(body),
  });
}

// Ingress: extract trace context from request headers (usually handled by auto-instrumentation)
function extractTraceContext(req: Request) {
  const ctx = propagation.extract(context.active(), req.headers);
  // Subsequent operations within ctx are automatically linked to the parent span
  return ctx;
}
  • Naming: use low-cardinality, searchable static fragments (e.g. HTTP GET /api/orders, db.query); do not embed user ids or full URLs in span names to avoid bloating the trace UI and storage.
  • Async: explicitly pass context across asyncio, thread pools, and callbacks; avoid fire-and-forget tasks that drop the parent and create orphan spans.

Log aggregation config and safety boundaries

  • Required or strongly enforced: timestamp, level, service.name (or equivalent), trace_id, span_id when available, message or event code.
  • Errors: add error.type, error.message (redacted), optional error.stack_fingerprint; never log full secrets or tokens.
  • Sampling: full debug only in staging or short-lived incidents; production defaults by level + always-on errors + tail or ratio sampling—document in the SKILL.

Loki log aggregation indexing strategy configuration:

# Loki indexing strategy (promtail config)
# Key principle: keep label cardinality low (high cardinality degrades Loki performance sharply)
scrape_configs:
  - job_name: kubernetes-pods
    kubernetes_sd_configs:
      - role: pod
    pipeline_stages:
      # Parse JSON structured logs
      - json:
          expressions:
            level: level
            trace_id: trace_id
            service: service.name
      # Only promote low-cardinality fields to Loki labels (used for indexing)
      - labels:
          level:        # ERROR/WARN/INFO/DEBUG (low cardinality)
          service:      # service name (low cardinality)
          # Note: do NOT set trace_id as a label! (high cardinality, use |= for filtering)
      # Filter sensitive fields
      - replace:
          expression: '("password"\s*:\s*)"[^"]*"'
          replace: '$1"[REDACTED]"'

# Query example: correlate logs by trace_id
# {service="payment-svc"} |= "4bf92f3577b34da6a3ce929d0e0e4736"

Trace id / traceparent validator

Paste a 32-character hex trace id or a full traceparent header (00-<trace>-<parent-span>-<flags>). Validation follows W3C Trace Context: length and charset, reject all-zero trace and parent ids. Parsing runs only in the browser.


              

Trace id must be exactly 32 hex chars and not all zeros; traceparent parent id (third segment) must be 16 hex chars and not all zeros; version and flags are 2 hex chars each.

---
name: logging-tracing
description: Structured logging and distributed tracing context specification
model: claude-sonnet-4-5
---

# Required structured log fields
required_fields:
  - timestamp (ISO 8601 UTC)
  - level (ERROR|WARN|INFO|DEBUG)
  - service.name + service.version
  - trace_id (32-char hex, when in a trace)
  - span_id (16-char hex, when available)
  - message or event code

# Log level conventions
log_levels:
  ERROR: requires immediate attention (service unavailable / data loss risk)
  WARN:  potential issue but still running (degraded / retry succeeded / near threshold)
  INFO:  key business events (request completed / state changed / start or stop)
  DEBUG: detailed troubleshooting (dev/staging only; off by default in production)

# Security constraints
security:
  forbidden_in_logs: [password, token, credit_card, ssn, api_key]
  pii_handling: hash_or_truncate (user_id hashed; email keeps domain only)
  stack_trace: fingerprint_only (full stack in trace backend)

# Loki indexing strategy
loki_labels: [level, service]  # low-cardinality labels
loki_forbidden_labels: [trace_id, user_id, request_id]  # high-cardinality

Back to skills More skills