Mutation testing
Deliberately break code to see if tests assert behavior: tool commands and timeouts, scoped runs, triaging survivors, and how this pairs with coverage and CI layering.
Mutation testing answers “if the implementation quietly regresses, will tests fail?”—it complements coverage: high coverage with many survivors often means weak assertions, happy-path only, or mutants on equivalent semantics.
The SKILL should name the tool (e.g. Stryker for JS/TS, PIT for Java), default operator sets, timeouts and parallelism, report paths, and how to record “equivalent mutants” in ignore rules or comments so the team does not tune out noise.
Goals and boundaries
Mutation testing does not replace unit tests—it measures sensitivity to small faults; does not demand 100% kill rate (equivalent mutants and broad operators raise cost).
- Good fit: core business modules, high-churn dirs, frequent regressions with thin behavior assertions.
- Use carefully: generated code, thin DTO shells, pure data classes (unless operators are tailored).
- With coverage: high coverage + many survivors → strengthen observable-behavior assertions before adding more unasserted branches.
Remember: metrics should drive tests and design, not dashboard vanity at unmaintainable scope.
Recommended workflow
[ Choose scope ]
package / directory / diff (incremental) + exclude generated & third-party
│
▼
[ Configure tool ]
operator subset · timeout · parallelism · report format · equivalent-ignore policy
│
▼
┌──────────────┐ PR: incremental / sample nightly: full or wider
│ Run mutations │──── fail build or upload-only (team choice)
└──────────────┘
│
▼
┌──────────────┐ classify: test gap / equivalent / too-wide op / real bug
│ Survivor triage │
└──────────────┘
│
▼
[ Deliver ] new tests/assertions · issue linked to requirement · updated ignores + rationale
Stabilize scope and config before chasing kill rate; otherwise reports are incomparable and teams ignore survivor lists.
Tools and command notes
Put copy-paste command templates (with config paths) in the SKILL; note Node/JVM versions and CI images must match local — avoid “works on my laptop.”
Complete Stryker configuration file (stryker.conf.json):
{
"$schema": "https://stryker-mutator.io/schemas/stryker-core.json",
"testRunner": "jest",
"jest": {
"projectType": "custom",
"configFile": "jest.config.ts"
},
"mutate": [
"src/**/*.ts",
"!src/**/*.test.ts",
"!src/**/*.spec.ts",
"!src/**/*.d.ts",
"!src/generated/**",
"!src/mocks/**"
],
"mutator": {
"plugins": ["@stryker-mutator/typescript-checker"],
"excludedMutations": [
"StringLiteral", // String literal mutations (too noisy)
"ObjectLiteral" // Object literal mutations (high false-positive rate)
]
},
"reporters": ["html", "json", "progress"],
"htmlReporter": {
"fileName": "reports/mutation/mutation.html"
},
"jsonReporter": {
"fileName": "reports/mutation/mutation.json"
},
"thresholds": {
"high": 80, // Mutation score >= 80% → green
"low": 60, // Mutation score 60-80% → yellow warning
"break": 50 // Mutation score < 50% → CI fails
},
"timeoutMS": 10000, // Per-mutant timeout: 10 seconds
"concurrency": 4, // Parallel test processes
"coverageAnalysis": "perTest", // Only run tests covering the mutated line (3-5x speedup)
"tempDirName": ".stryker-tmp",
"cleanTempDir": true
}
Mutation type comparison (with code examples):
Type 1: Boundary condition (ConditionalExpression)
Original: if (amount >= 100) return discount;
Mutated: if (amount > 100) return discount; // ← >= changed to >
Meaning: Tests must assert behavior when amount === 100
Type 2: Logical operator (LogicalOperator)
Original: if (isVip && !isBlocked) return premium;
Mutated: if (isVip || !isBlocked) return premium; // ← && changed to ||
Meaning: Tests must cover scenario: isVip=true but isBlocked=true
Type 3: Return value (ReturnValue)
Original: function getDiscount() { return 0.2; }
Mutated: function getDiscount() { return 0; } // ← return value changed to zero
Meaning: Tests must assert the return value, not just "was called"
Type 4: Statement deletion (BlockStatement)
Original: metrics.increment('order.created'); return orderId;
Mutated: return orderId; // ← metrics.increment deleted
Meaning: Tests must verify the metrics call; otherwise monitoring code can be silently removed
Mutation score interpretation:
Mutation score = killed mutants / (total mutants - equivalent mutants) × 100% > 80% → Good: test suite is highly sensitive to code changes; fits core business modules 60-80% → Fair: most errors caught, room to improve < 60% → Poor: many errors could slip past undetected Practical examples: 85% (Good) = Payment module: all boundary conditions and error paths have assertions 70% (Fair) = User module: basic logic covered, but error branches and boundaries are weakly asserted 45% (Poor) = Report module: tests only assert "function was called", not output correctness
Identifying tests that "pass but assert nothing meaningful":
// Signal: mutant survived (tests pass but mutant lives)
// What it means: tests exist, but even when the code is broken, tests still pass
// Example: ReturnValue mutant survived
// Original function:
function calculateTax(amount: number) { return amount * 0.1; }
// Mutated version (generated by Stryker):
function calculateTax(amount: number) { return 0; } // ← returns 0
// Test code (why the mutant survived):
it('should calculate tax', () => {
const spy = jest.spyOn(taxService, 'calculateTax');
checkout.process(100);
expect(spy).toHaveBeenCalledWith(100); // ← only asserts the argument, not the return value!
// Mutated version returns 0 and this test still passes → mutant survives
});
// Fix: add assertion on the return value
it('should calculate 10% tax', () => {
expect(calculateTax(100)).toBe(10); // ← assert actual return value
expect(calculateTax(0)).toBe(0); // ← boundary condition
});
// Run mutation testing only on changed files (PR scenario):
// npx stryker run --since origin/main
| Ecosystem | Tool | Agent should record |
|---|---|---|
| JS / TS | Stryker | stryker.conf, test runner, mutate glob, thresholds |
| Java / Kotlin (JVM) | PIT, etc. | Target classes, exclusions, incremental mode, build plugin wiring |
| Other | Language-specific | Incremental support, HTML report path, Bazel/Gradle integration |
- Set explicit timeouts and retry policy for long jobs so they do not clog CI.
- Artifact retention: days kept, main-only archives if applicable.
Scope, performance, and CI
Default to incremental mutation on changed code; full runs on nightly or pre-release. Exclude dist, generated, lockfiles, snapshot dirs.
- PR gates: e.g. “no new high-risk survivors” or “score not below baseline”—not necessarily full kill each time.
- Equivalent mutants: document in config or adjacent comments; audit ignore lists so real risk is not permanently muted.
- Parallelism and sharding: split large repos by module to see whether slowness is tests or the mutation engine.
Reading survived mutants
Each survivor should bucket into one of the following with a traceable review note.
- Weak tests: add assertions, split cases, cover errors and boundaries.
- Equivalent mutant: semantics unchanged → ignore in config + comment “why equivalent.”
- Operator too broad: meaningless for this style → narrow
mutatescope or disable a subset. - Product / defect discussion: mutant exposes behavior vs requirement mismatch → open an issue; do not greenwash tests.
Do not weaken assertions into implementation-detail checks or over-mock just to improve the score; align with externally observable behavior and contracts.
Output spec for agents
When analyzing mutation reports, output should support human merge decisions—not a raw HTML dump.
- Summary: scope, tool version, total mutants, killed, mutation score if provided.
- Top survivors: sorted by module/risk; each with file, line, mutant type, action (add test / equivalent / ignore / confirm requirement).
- Requirement links: IDs or stories for review cross-check.
- Follow-up: one local repro command and config path.
Operator checklist and score estimate
Tick common mutation operator categories your team explicitly covers in docs or config; on the right, estimate mutation score from killed / total (minus equivalent) and see the bar fill.
Operator coverage self-check
0 / 6 operator categories checked (declared in team docs or config).
Mutation score estimate
Formula: killed ÷ (total mutants − equivalent) × 100%. Equivalent = 0 reduces to killed / total.
Effective denominator 55, mutation score ~76.4% (13 survived)
---
name: mutation-testing
description: Run mutation testing and triage survivors with test suggestions
---
# Essentials
# Step 1: Install and initialize Stryker
npm install --save-dev @stryker-mutator/core @stryker-mutator/jest-runner
npx stryker init # generates stryker.conf.json
# Step 2: Configure key fields (stryker.conf.json)
mutate: ["src/**/*.ts", "!src/**/*.test.ts", "!src/generated/**"]
thresholds: { high: 80, low: 60, break: 50 }
coverageAnalysis: "perTest" # Only run tests covering the mutated point (3-5x speedup)
concurrency: 4
timeoutMS: 10000
# Step 3: Run only on changed files (PR scenario)
npx stryker run --since origin/main
# Step 4: Identify meaningless assertions (common causes of survived mutants)
Signal: ReturnValue mutant survived → test only asserts "was called", not the return value
Signal: ConditionalExpression survived → missing boundary value test (e.g. amount === 100)
Fix: replace expect(spy).toHaveBeenCalled() with expect(result).toBe(10)
# Step 5: Handle equivalent mutants
// Stryker disable next-line: ArrowFunction
const noop = () => {}; // This function is a noop; mutation is meaningless
# Step 6: Score interpretation
> 80% → Good (suitable for payment/auth core modules)
60-80% → Fair (add boundary and error path tests)
< 60% → Poor (tests only cover happy path with weak assertions)
# Step 7: CI layering strategy
PR: npx stryker run --since origin/main (changed files only)
nightly: npx stryker run (full run, archive reports for 30 days)
gate: break: 50 (CI fails if score drops below 50%)