Race condition debugging

Make ordering explicit: non-atomic compound operations on shared mutable state can break invariants under interleaved execution. This skill aligns happens-before, locking, and testing—and uses a hand-scheduled example to feel lost updates.

The SKILL guides the agent to first list participants (threads, coroutines, processes, replicas) and shared resources (memory cells, cache lines, rows, document versions), then ask: without mutual exclusion or synchronization, can two legal schedules yield different final states? If you cannot answer formally, race risk remains to validate.

Overview and common patterns

A race means the outcome of concurrent access to shared mutable state depends on scheduling order; data race is usually narrower in the language memory model (e.g. conflicting non-atomic accesses without synchronization). They often show up together, but fixes differ: sometimes visibility (volatile / publish-subscribe) suffices; sometimes you need mutual exclusion or a single-writer redesign.

check-then-act: read a condition then mutate—another thread can interpose (e.g. “if stock > 0 then decrement”).
Non-atomic read-modify-write: counters, bitmaps, JSON patch read-modify-write.
Double-checked locking: another thread observes a half-initialized object.
Across await / I/O: the critical section splits at async boundaries and exposes intermediate state.

Heisenbug tip: logging can change timing and hide the bug. Pair sampling, hardware counters, detector tools without printf, or write a reproducible “stress + assertion” first, then narrow logging.

Check-then-act race condition and database fix:

// ❌ check-then-act: stock deduction race (Node.js)
async function deductStock(productId, qty) {
  const stock = await db.query(
    "SELECT stock FROM products WHERE id = ?", [productId]);
  // ← two concurrent requests both see stock=1, both pass the check
  if (stock.rows[0].stock >= qty) {
    await db.query(
      "UPDATE products SET stock = stock - ? WHERE id = ?",
      [qty, productId]);  // both deduct; stock becomes -1
  }
}

// ✅ Database-level fix: SELECT FOR UPDATE (pessimistic lock)
async function deductStockSafe(productId, qty) {
  await db.query("BEGIN");
  try {
    const result = await db.query(
      "SELECT stock FROM products WHERE id = ? FOR UPDATE",
      [productId]  // row-level lock; other transactions must wait
    );
    if (result.rows[0].stock >= qty) {
      await db.query(
        "UPDATE products SET stock = stock - ? WHERE id = ?",
        [qty, productId]);
    }
    await db.query("COMMIT");
  } catch (e) {
    await db.query("ROLLBACK");
    throw e;
  }
}

// ✅ Optimistic lock: UPDATE ... WHERE stock >= qty (uses atomicity)
const res = await db.query(
  "UPDATE products SET stock = stock - ? WHERE id = ? AND stock >= ?",
  [qty, productId, qty]
);
if (res.rowCount === 0) throw new Error("Out of stock or concurrent conflict");

Happens-before and visibility

Happens-before (and distributed analogues: precedence / causal order) answers: will thread B necessarily see thread A’s write? Without an hb edge, compilers, CPUs, and caches may reorder; readers may see stale values or torn reads.

Language level: lock release-acquire, thread start/join, volatile (language-specific), atomic memory orders.
Concurrent collections: “produce happens-before consume” on queues often carries visibility.
Distributed: version vectors, leases, fences; client retries with idempotency keys relate to “logically replaying the same operation” and need their own model.

When documenting, draw hb edges that must hold—e.g. between “write config” and “publish flag.” A missing edge is a candidate root cause, not an automatic “add locks everywhere.”

Locks, critical sections, and async gaps

Mutexes wrap check-then-act in a critical section at the cost of granularity and deadlock risk. Read-write locks suit read-heavy workloads but watch reader upgrade, writer starvation, and cache-line bouncing. Lock-free (CAS, RCU) pushes correctness into memory ordering—harder to debug; use detectors and formal comments to constrain usage.

Keep critical sections as short as possible; avoid I/O, RPC, slow hashing inside the lock.
Global lock ordering to avoid AB-BA; define failure policy for timed locks.
async/await: if the same mutable state is touched before and after await, other tasks can run in the gap—narrow interleaving with queues, actors, or single-thread event-loop mutation.

Optimistic concurrency (version fields, CAS): on retry, check business semantics—infinite spin, ABA, or “wrong retry causes duplicate side effects” are common second-order failures.

Redis distributed lock (SET NX PX + Lua atomic release):

// Redis distributed lock (Node.js + ioredis)
const Redis = require("ioredis");
const { randomUUID } = require("crypto");
const redis = new Redis();

// Acquire lock: SET key value NX PX ttl
async function acquireLock(key, ttlMs = 5000) {
  const token = randomUUID();  // unique token prevents releasing someone else's lock
  const ok = await redis.set(key, token, "NX", "PX", ttlMs);
  return ok === "OK" ? token : null;
}

// Release lock: Lua script ensures check-and-delete atomicity
const RELEASE_SCRIPT = `
  if redis.call("GET", KEYS[1]) == ARGV[1] then
    return redis.call("DEL", KEYS[1])
  else
    return 0
  end`;
async function releaseLock(key, token) {
  return redis.eval(RELEASE_SCRIPT, 1, key, token);
}

// Usage pattern
async function criticalSection(resourceId) {
  const lockKey = `lock:${resourceId}`;
  const token = await acquireLock(lockKey, 3000);
  if (!token) throw new Error("Failed to acquire lock, please retry");
  try {
    // perform critical section work
    await doWork(resourceId);
  } finally {
    await releaseLock(lockKey, token);  // must release; TTL is last resort
  }
}

Testing and amplification

The goal is to turn rare schedules into reliably failing runs, then align fixes with timing hypotheses.

ThreadSanitizer / race detectors: C/C++, Go, etc.; Java has dedicated tools plus strict modes with review.
Stress and concurrency tests: multi-thread loops, seeded random scheduling, shorter sleeps to amplify interleaving.
Property-based / model checking: declare invariants for small state machines; frameworks shuffle operation sequences.
Distributed: Jepsen-style partitions and clock skew; databases use isolation levels + unique constraints as last lines of defense.

Each fix should ship a regression strategy: keep the stress test, or keep a static/detector gate—avoid “fix then delete the test” relapses.

Go -race detector and concurrent test framework examples:

// Go race detector usage
// go test -race ./...        # run all tests with race detector enabled
// go run -race main.go       # run-time enable (5-10x overhead, test only)
// go build -race -o app_race # build race-enabled binary

// Go concurrent safety test example
func TestCounter_Concurrent(t *testing.T) {
    c := &Counter{}
    var wg sync.WaitGroup
    const goroutines = 100
    wg.Add(goroutines)
    for i := 0; i < goroutines; i++ {
        go func() {
            defer wg.Done()
            c.Increment()
        }()
    }
    wg.Wait()
    // go test -race will catch data race on c.value
    if c.Value() != goroutines {
        t.Errorf("got %d, want %d", c.Value(), goroutines)
    }
}

// Node.js concurrency test (multiple Promises fire simultaneously)
test("stock deduction — no oversell", async () => {
  await db.query("UPDATE products SET stock = 1 WHERE id = 1");
  // fire 10 concurrent deduction requests; only 1 should succeed
  const results = await Promise.allSettled(
    Array.from({ length: 10 }, () => deductStockSafe(1, 1))
  );
  const succeeded = results.filter(r => r.status === "fulfilled").length;
  expect(succeeded).toBe(1);  // assert exactly 1 success
  const stock = await db.query("SELECT stock FROM products WHERE id = 1");
  expect(stock.rows[0].stock).toBe(0);  // must not be negative
});

Debug flow

Follow a fixed order to avoid reshuffling lock layers before shared state is clear, causing deadlocks or masking real interleavings.

  [ List shared mutable state + all readers/writers ]
                    │
                    ▼
         [ Mark non-atomic compounds: check-then-act / RMW / across await ]
                    │
                    ▼
    [ Derive happens-before: which writes must be visible to which reads? Missing edge? ]
                    │
           ┌────────┴────────┐
           ▼                 ▼
    [ Choose sync primitive ]   [ Structural fix: immutable / single writer / queue ]
           │                 │
           └────────┬────────┘
                    ▼
         [ TSAN / stress / property tests amplify and gate ]
                    │
                    ▼
              [ Document invariants and legal scheduling assumptions ]

Interleaving lab and checklist

Below, initial x = 0. Thread A: read x → write x as read value + 1. Thread B: read x → write x as read value + 2. Serializing all of one thread then the other vs. one typical interleaving yields different finals—useful to explain lost updates to colleagues new to concurrency.

Click a preset to see step-by-step trace and final x (front-end only, no network).

Field checklist (saved in this browser):

Listed all shared mutable state and access sites (including re-entry after async boundaries).
Annotated happens-before for critical read/write pairs, or explicitly “no guarantee yet.”
Reproduced with a detector or stress test; minimal assertion/logs match scheduling hypothesis.
After fix: reviewed critical-section size, lock order, retry semantics; added regression gate.

---
name: race-condition
description: Timing, happens-before, locks, and data-race detection flow
---
# Steps
1. Mark shared mutable state, readers/writers, and gaps across await
2. Derive happens-before; if an edge is missing, add sync or structural elimination
3. Implement invariants with locks/atomics/queues; gate with TSAN, stress, or property tests

Back to skills More skills