Event-driven design
Guides agents to separate commands from events and define publisher ownership; align “database committed—with “message visible—using a transactional outbox; document Saga orchestration/choreography, compensation, and human touchpoints. This page includes a collapsible table of contents and an event-card draft lab (evt-page script).
Name domain events in the past tense and carry the smallest useful payload; separate integration events from internal ones so aggregate internals do not leak to the global bus.
For cross-service consistency, state the eventual-consistency window and human touchpoints explicitly; each Saga step must be retryable, compensatable, or traceably abortable with a reason.
In a SKILL, require a consumer list and subscription versioning strategy so implicit coupling does not create deployment-order hell.
One-page summary
- Commands change state; events state facts; the publisher owns integration-event schema and semantic versioning.
- When visibility must align with a local transaction, use an outbox (or audited CDC) to avoid dual-write races.
- Long-running work across boundaries uses a Saga: explicit state machine, compensation order, and message delivery semantics (at-least-once + idempotency) aligned.
// CloudEvents 1.0 canonical event format (JSON)
{
"specversion": "1.0",
"id": "b46d462a-4c6f-4c1e-9e8b-3f5d1c7a9b2c",
"type": "com.acme.ordering.order.placed.v1",
"source": "/ordering-service/orders",
"subject": "order/88421",
"time": "2024-03-15T10:30:00Z",
"datacontenttype": "application/json",
"dataschema": "https://schemas.acme.com/ordering/order-placed/v1.json",
"traceparent": "00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01",
"data": {
"orderId": "88421",
"customerId": "cust-12345",
"totalAmount": 99.50,
"currency": "USD",
"items": [{ "skuId": "sku-001", "qty": 2, "price": 49.75 }]
}
}
// Event naming convention: {domain}.{aggregate}.{event}.{version}
// type: "com.acme.ordering.order.placed.v1"
// ───────── ──────── ───── ─────── ──
// org domain agg event version
//
// Partition key (Kafka / ordering guarantee): use aggregate ID from subject
// kafka_key = "order/88421" → all events for the same order go to the same partition
// Event schema versioning: forward/backward compatibility strategy
// Backward-compatible (new consumer can read old events):
// ✅ Add optional fields (with defaults)
// ✅ Extend enum values
// ❌ Remove fields / change field types / rename fields
// v1 event (old)
{ "orderId": "88421", "status": "placed" }
// v2 event (added optional field, backward-compatible)
{ "orderId": "88421", "status": "placed", "customerId": "cust-123" }
// Forward-compatible (old consumer can read new events, ignores unknown fields):
// Consumer must use @JsonIgnoreProperties(ignoreUnknown=true) or equivalent
// For breaking changes: publish a new type (.v2) and maintain dual-version consumer window
// Python example (pydantic forward compatibility)
from pydantic import BaseModel, ConfigDict
class OrderPlacedV1(BaseModel):
model_config = ConfigDict(extra="ignore") # ignore unknown fields
orderId: str
status: str
customerId: str | None = None # new optional field has a default
End-to-end flow (skill-flow-block)
[ Command ingress: API / scheduler / integration adapter ]
│
▼
┌──────────────────────│
│ Aggregate txn: state │──── In-domain: domain events (in-memory collection)
│ + outbox row (same │ Integration payload trimmed at mapping layer
│ transaction) │
└──────────────────────│
│
▼
┌──────────────────────│
│ Outbox dispatcher: │──── Mark published / retry / DLQ;
│ poll / CDC / log tail│ align with broker at-least-once
└──────────────────────│
│
▼
┌──────────────────────│
│ Message bus: topic / │──── Ordering key, TTL, DLQ, schema registry
│ partition key / group│
└──────────────────────│
│
┌───────────┴───────────│
▼ ▼
[ Consumer: idempotent ] [ Saga: orchestrator or choreography ]
│ │
└───────────┬───────────│
▼
[ Observability: trace / business metrics / audit ]
If you skip the outbox, document in the SKILL the acceptable loss window or alternative (e.g. synchronous API + retry-storm risk); do not assume fire-and-forget delivery.
Transactional outbox
Within one database transaction, write business rows and an outbox table (or equivalent log entry); a separate process publishes in order so “business committed—and “externally visible message—move together monotonically.
- Table shape: event type, payload version, aggregate id, correlation/causation ids, publish status and timestamps.
- Dispatcher: single writer or partition lock to avoid duplicate publish; backoff and poison handling documented with DLQ policy.
- vs CDC: outbox models message boundaries explicitly; CDC must handle change noise and projection lag—document SLIs in the SKILL.
-- Outbox table DDL (PostgreSQL)
CREATE TABLE outbox_events (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
aggregate_type VARCHAR(64) NOT NULL, -- e.g. 'Order'
aggregate_id VARCHAR(128) NOT NULL, -- e.g. '88421'
event_type VARCHAR(128) NOT NULL, -- e.g. 'com.acme.ordering.order.placed.v1'
payload JSONB NOT NULL,
correlation_id UUID, -- trace causality chain
causation_id UUID, -- ID of upstream event that triggered this one
status VARCHAR(16) NOT NULL DEFAULT 'pending', -- pending/published/failed
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
published_at TIMESTAMPTZ,
retry_count INT NOT NULL DEFAULT 0
);
CREATE INDEX idx_outbox_pending ON outbox_events (created_at) WHERE status = 'pending';
-- Write business data + outbox in one transaction (Python + psycopg2)
def place_order(conn, order_data: dict) -> str:
with conn.cursor() as cur:
# 1. Write business data
cur.execute("INSERT INTO orders (...) VALUES (...) RETURNING id", order_data)
order_id = cur.fetchone()[0]
# 2. Write outbox in same transaction
cur.execute("""
INSERT INTO outbox_events (aggregate_type, aggregate_id, event_type, payload)
VALUES ('Order', %s, 'com.acme.ordering.order.placed.v1', %s)
""", (str(order_id), json.dumps({"orderId": str(order_id), **order_data})))
conn.commit() # both committed atomically
return order_id
Saga: orchestration, choreography, compensation
Orchestration: a central process issues commands and waits for completion events—simple mental model but coordinator availability is a risk. Choreography: services subscribe to each other’s events—decentralized but needs strong contracts and observability.
- Per step define: input event, local side effect, success event, compensation or retry on failure, timeout and human ticket entry.
- Compensation should be semantic (cancel booking, refund) rather than physical deletes; when not possible, document account freeze or ops workflow.
- With outbox: Saga log and local transaction in the same transaction or clearly separated eventual-consistency segments to avoid “next step sent while prior step not persisted” ghost states.
## Saga pattern comparison
### Choreography — decentralized, each service subscribes to others' events
# OrderService publishes OrderPlaced
# InventoryService listens → deducts stock → publishes InventoryReserved
# PaymentService listens → charges → publishes PaymentProcessed
# Compensation: PaymentFailed → InventoryService listens → releases stock → publishes InventoryReleased
# Pros: no single point of failure, loosely coupled
# Cons: event chain is hard to trace, testing is complex, adding a service requires updating multiple subscriptions
### Orchestration — central coordinator explicitly controls each step
class OrderSaga:
def __init__(self, order_id: str):
self.order_id = order_id
self.state = "STARTED"
def execute(self):
try:
# Step 1: reserve inventory
self.state = "RESERVING_INVENTORY"
inventory_svc.reserve(self.order_id)
# Step 2: charge payment
self.state = "PROCESSING_PAYMENT"
payment_svc.charge(self.order_id)
# Step 3: ship order
self.state = "SHIPPING"
shipping_svc.ship(self.order_id)
self.state = "COMPLETED"
except InventoryError:
self.compensate_inventory()
self.state = "FAILED"
raise
except PaymentError:
self.compensate_payment()
self.compensate_inventory() # compensation order is reverse of forward steps
self.state = "FAILED"
raise
def compensate_inventory(self):
# Semantic compensation: release reservation, not physical record deletion
inventory_svc.release(self.order_id)
def compensate_payment(self):
payment_svc.refund(self.order_id)
# Pros: centralized flow, easy to trace, timeout and retry managed in one place
# Cons: coordinator is a single point of failure; needs high-availability deployment
Event contracts and consumers
- Orchestration vs choreography: central coordinator SPOF vs decentralized trade-offs.
- Ordering and duplicates: align with broker delivery semantics; idempotency keys and dedupe windows belong in the contract.
- Evolution: forward/backward schema compatibility rules; deprecated fields and dual-write/read-new strategy.
SKILL front-matter example
---
name: event-driven-design
description: Map domain event flows, CloudEvents format, and cross-service Saga collaboration
---
# Event contract
Format: CloudEvents 1.0 (specversion/id/type/source/subject/time/data)
type naming: {org}.{domain}.{aggregate}.{event}.{version}
Partition key: aggregate ID in subject, guarantees same-aggregate event ordering
# Schema versioning
Backward-compatible: add optional fields (with defaults) / extend enums
Breaking change: publish new type (.v2), consumers window dual-version consumption
Consumer side: extra="ignore" (ignore unknown fields) ensures forward compatibility
# Outbox pattern
Table: id/aggregate_type/aggregate_id/event_type/payload/status/retry_count
Write business row + outbox in same transaction; dispatcher polls status=pending to publish
# Saga selection
Choreography: simple flow / loose coupling / fewer than 3 steps → each service subscribes to events
Orchestration: complex flow / unified tracing / timeout management → central coordinator
Compensation order: reverse of forward steps; semantic (refund/release) not physical deletes
# Output structure
Event card: name/type/source/payload fields/consumer list
Saga diagram: steps/success events/failure compensation/timeout entry points
Failure: retry (at-least-once + idempotency) / compensation / human ticket
Event card draft lab (evt-page JS)
For reviews or tickets: checkboxes append Outbox / Saga checklist lines automatically.
Lab output is for communication only; real contracts must match schema registry, consumer repos, and Saga state-machine definitions.