Logging & distributed tracing
Guide Agents to design structured logs, propagate trace/span context, and correlate requests, errors, and business identifiers across services to shorten time to root cause.
A SKILL should define log fields (time, level, service name, trace_id, span_id, redacted user or tenant id), sampling policy, and sensitive-data filters—avoid logging secrets or full PII.
For tracing, document gateway or sidecar injection, async and message propagation, and how log platforms and APM join on trace_id.
For incidents, Agents can narrow by time and error fingerprint first, then pull upstream/downstream spans along the trace, then align with code paths and config change timelines.
- Hard requirements: critical paths carry
request_id/trace_id; error logs include actionable detail, not only stack summaries. - Align with metrics: high-cardinality labels and full debug logs are anti-patterns the SKILL should forbid or throttle.
- Link to on-call playbooks: typical queries (e.g.
service+status+trace_id).
Distributed context flow (skill-flow-block)
[ Ingress: gateway / LB / sidecar ]
│
▼
[ Extract or generate trace context (W3C traceparent / vendor headers) ]
│
▼
[ In service: create root or child span; bind async boundaries (task / queue) ]
│
┌────────┴────────┐
▼ ▼
[ Structured log: same record carries trace_id + span_id + business keys ] [ Outbound: inject downstream headers or message attrs ]
│ │
└────────┬────────┘
▼
[ Store: log index + trace backend; join on trace_id ]
Correlation: trace, span, and request id
- Trace ID: identifies the full distributed request across services; in W3C Trace Context it is 32 hex chars. One trace has many spans.
- Span ID: identifies one unit of work in the trace (often 16 hex); parent/child spans form a tree with time bounds.
-
Correlation / request id: a business or gateway key that may coexist with trace—for support tickets, idempotency, or legacy alignment. The SKILL should define mapping (e.g. gateway generates request id and writes
traceparentor a log field) to avoid same name, different meaning. - Queues and batch: carry W3C carrier or equivalent in message headers or envelope; consumer spans parent on producer span with the same trace id.
Structured log JSON format standard (required fields and examples):
// Structured log JSON format standard
// Required fields (consistent across all services)
{
"timestamp": "2024-03-15T14:02:33.421Z", // ISO 8601 UTC
"level": "ERROR", // ERROR|WARN|INFO|DEBUG
"service": {
"name": "payment-svc",
"version": "2.4.1",
"instance": "payment-svc-7d9f8b-xk2p9" // pod/instance id
},
"trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
"span_id": "00f067aa0ba902b7",
"message": "Database connection pool exhausted",
// Required for error-level logs
"error": {
"type": "PoolExhaustedException",
"message": "Connection pool exhausted after 30000ms",
"stack_fingerprint": "sha256:a3b4c5..." // full stack in trace backend; only fingerprint here
},
// Business context (redacted)
"request": {
"id": "req-8f4a2d1c", // request_id (not user id)
"method": "POST",
"path": "/api/v1/payments" // no query params (may contain sensitive data)
},
// Optional: frequently accessed fields
"user_id_hash": "sha256:f8a3...", // hashed, not plaintext
"tenant_id": "tenant-123" // for multi-tenant routing
}
// Log level guidelines:
// ERROR: production issues requiring immediate attention (service unavailable, data loss risk)
// WARN: potential problems but service still running (degraded, retry succeeded, near threshold)
// INFO: key business events (request completed, state changed, start/stop)
// DEBUG: detailed troubleshooting info (dev/staging only; off by default in production)
OpenTelemetry integration code
Node.js instrumentation (using @opentelemetry/sdk-node):
// Node.js OpenTelemetry initialization (run first at app entry)
// instrumentation.js
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { Resource } from '@opentelemetry/resources';
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
const sdk = new NodeSDK({
resource: new Resource({
[SemanticResourceAttributes.SERVICE_NAME]: 'payment-svc',
[SemanticResourceAttributes.SERVICE_VERSION]: process.env.APP_VERSION,
}),
traceExporter: new OTLPTraceExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://otel-collector:4318/v1/traces',
}),
instrumentations: [
getNodeAutoInstrumentations({
'@opentelemetry/instrumentation-http': { enabled: true },
'@opentelemetry/instrumentation-pg': { enabled: true },
}),
],
});
sdk.start();
// Manually create a Span and pass trace context
import { trace, context } from '@opentelemetry/api';
async function processPayment(req, paymentData) {
const tracer = trace.getTracer('payment-svc');
const span = tracer.startSpan('payment.process', {
attributes: {
'payment.amount': paymentData.amount,
'payment.currency': paymentData.currency,
'http.method': req.method,
}
});
return context.with(trace.setSpan(context.active(), span), async () => {
try {
const result = await chargeCard(paymentData);
span.setStatus({ code: SpanStatusCode.OK });
return result;
} catch (err) {
span.recordException(err);
span.setStatus({ code: SpanStatusCode.ERROR, message: err.message });
throw err;
} finally {
span.end();
}
});
}
Python OpenTelemetry instrumentation:
# Python OpenTelemetry integration (FastAPI example)
# requirements: opentelemetry-sdk opentelemetry-instrumentation-fastapi
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
import structlog
# Initialize Tracer
provider = TracerProvider(
resource=Resource.create({
"service.name": "payment-svc",
"service.version": os.getenv("APP_VERSION", "unknown"),
})
)
provider.add_span_processor(BatchSpanProcessor(
OTLPSpanExporter(endpoint="http://otel-collector:4318/v1/traces")
))
trace.set_tracer_provider(provider)
# Auto-instrument FastAPI
app = FastAPI()
FastAPIInstrumentor.instrument_app(app)
# Structured log + trace_id correlation
def get_logger():
span = trace.get_current_span()
ctx = span.get_span_context()
return structlog.get_logger().bind(
trace_id=format(ctx.trace_id, '032x') if ctx.is_valid else None,
span_id=format(ctx.span_id, '016x') if ctx.is_valid else None,
)
@app.post("/payments")
async def create_payment(payment: PaymentRequest):
log = get_logger()
tracer = trace.get_tracer("payment-svc")
with tracer.start_as_current_span("payment.create") as span:
span.set_attribute("payment.amount", payment.amount)
log.info("payment.processing", amount=payment.amount)
# ... business logic
Trace ID propagation in HTTP requests (traceparent injection):
// Propagating trace context to downstream HTTP calls (Node.js fetch)
import { propagation, context } from '@opentelemetry/api';
import { W3CTraceContextPropagator } from '@opentelemetry/core';
// Outbound: inject traceparent header automatically
async function callDownstreamService(path: string, body: unknown) {
const headers: Record = {
'Content-Type': 'application/json',
};
// Inject current trace context into HTTP headers
propagation.inject(context.active(), headers);
// After injection, headers contain:
// traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
return fetch(`http://order-svc${path}`, {
method: 'POST',
headers,
body: JSON.stringify(body),
});
}
// Ingress: extract trace context from request headers (usually handled by auto-instrumentation)
function extractTraceContext(req: Request) {
const ctx = propagation.extract(context.active(), req.headers);
// Subsequent operations within ctx are automatically linked to the parent span
return ctx;
}
-
Naming: use low-cardinality, searchable static fragments (e.g.
HTTP GET /api/orders,db.query); do not embed user ids or full URLs in span names to avoid bloating the trace UI and storage. -
Async: explicitly pass context across
asyncio, thread pools, and callbacks; avoid fire-and-forget tasks that drop the parent and create orphan spans.
Log aggregation config and safety boundaries
- Required or strongly enforced:
timestamp,level,service.name(or equivalent),trace_id,span_idwhen available,messageor event code. - Errors: add
error.type,error.message(redacted), optionalerror.stack_fingerprint; never log full secrets or tokens. - Sampling: full debug only in staging or short-lived incidents; production defaults by level + always-on errors + tail or ratio sampling—document in the SKILL.
Loki log aggregation indexing strategy configuration:
# Loki indexing strategy (promtail config)
# Key principle: keep label cardinality low (high cardinality degrades Loki performance sharply)
scrape_configs:
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod
pipeline_stages:
# Parse JSON structured logs
- json:
expressions:
level: level
trace_id: trace_id
service: service.name
# Only promote low-cardinality fields to Loki labels (used for indexing)
- labels:
level: # ERROR/WARN/INFO/DEBUG (low cardinality)
service: # service name (low cardinality)
# Note: do NOT set trace_id as a label! (high cardinality, use |= for filtering)
# Filter sensitive fields
- replace:
expression: '("password"\s*:\s*)"[^"]*"'
replace: '$1"[REDACTED]"'
# Query example: correlate logs by trace_id
# {service="payment-svc"} |= "4bf92f3577b34da6a3ce929d0e0e4736"
Trace id / traceparent validator
Paste a 32-character hex trace id or a full traceparent header (00-<trace>-<parent-span>-<flags>). Validation follows W3C Trace Context: length and charset, reject all-zero trace and parent ids. Parsing runs only in the browser.
Trace id must be exactly 32 hex chars and not all zeros; traceparent parent id (third segment) must be 16 hex chars and not all zeros; version and flags are 2 hex chars each.
---
name: logging-tracing
description: Structured logging and distributed tracing context specification
model: claude-sonnet-4-5
---
# Required structured log fields
required_fields:
- timestamp (ISO 8601 UTC)
- level (ERROR|WARN|INFO|DEBUG)
- service.name + service.version
- trace_id (32-char hex, when in a trace)
- span_id (16-char hex, when available)
- message or event code
# Log level conventions
log_levels:
ERROR: requires immediate attention (service unavailable / data loss risk)
WARN: potential issue but still running (degraded / retry succeeded / near threshold)
INFO: key business events (request completed / state changed / start or stop)
DEBUG: detailed troubleshooting (dev/staging only; off by default in production)
# Security constraints
security:
forbidden_in_logs: [password, token, credit_card, ssn, api_key]
pii_handling: hash_or_truncate (user_id hashed; email keeps domain only)
stack_trace: fingerprint_only (full stack in trace backend)
# Loki indexing strategy
loki_labels: [level, service] # low-cardinality labels
loki_forbidden_labels: [trace_id, user_id, request_id] # high-cardinality