Evaluation & benchmarks
This page provides the complete JSON format for eval test cases (input/expected/eval_fn/tags/metadata), three eval function implementations (exact_match/regex_match/llm_judge), a GitHub Actions CI integration snippet, and a JSON report format for displaying eval results in PRs.
Complete JSON format for an eval test case—each case contains input/expected/eval_fn/tags/metadata:
{
"id": "tc_001",
"input": {
"messages": [
{"role": "user", "content": "What is the weather in London today?"}
],
"tools": ["get_weather"]
},
"expected": {
"tool_calls": [
{"name": "get_weather", "arguments": {"city": "London"}}
],
"output_contains": ["temperature", "weather"],
"output_not_contains": ["I cannot", "I don't know"]
},
"eval_fn": "tool_call_match",
"tags": ["smoke", "tool-use", "weather"],
"metadata": {
"created_by": "alice@example.com",
"created_at": "2026-04-01",
"last_reviewed": "2026-04-10",
"model_version": "gpt-4o-2024-11-20",
"seed": 42,
"pii_level": "none",
"license": "internal"
}
}
[ Case set / version & seed pinned ]
│
▼
[ Runner batch execute ]
│
┌────┴────┐
▼ ▼
[ Rule score ] [ Judge / spot check ]
│ │
└────┬────┘
▼
[ Aggregate metrics & threshold gates ]
│
┌────┴────────┐
▼ ▼
[ Trends & baseline ] [ Failed artifact archive ]
Three eval function implementations
import re, json
from openai import OpenAI
client = OpenAI()
# 1. exact_match: output must equal expected exactly (for structured output)
def exact_match(actual: str, expected: str) -> dict:
passed = actual.strip() == expected.strip()
return {"passed": passed, "score": 1.0 if passed else 0.0,
"reason": "exact match" if passed else f"expected: {expected!r}, got: {actual!r}"}
# 2. regex_match: check whether output contains/excludes specific patterns
def regex_match(actual: str, rules: dict) -> dict:
"""
rules format:
{
"must_contain": ["pattern1", "pattern2"],
"must_not_contain": ["bad_pattern"],
}
"""
failures = []
for pattern in rules.get("must_contain", []):
if not re.search(pattern, actual, re.IGNORECASE):
failures.append(f"missing required pattern: {pattern!r}")
for pattern in rules.get("must_not_contain", []):
if re.search(pattern, actual, re.IGNORECASE):
failures.append(f"contains forbidden pattern: {pattern!r}")
passed = len(failures) == 0
return {"passed": passed, "score": 1.0 if passed else 0.0,
"failures": failures}
# 3. llm_judge: use GPT-4o to judge output quality (for natural-language output)
def llm_judge(
question: str,
actual: str,
criteria: str,
model: str = "gpt-4o",
) -> dict:
"""
Use an LLM to judge whether the output meets the evaluation criteria.
Returns score (0-1) and reason.
"""
prompt = f"""Please evaluate whether the following answer meets the criteria.
Question: {question}
Answer: {actual}
Evaluation criteria: {criteria}
Respond in JSON format:
{{"score": 0-10, "passed": true/false, "reason": "one-sentence justification"}}
Output JSON only, nothing else."""
resp = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
temperature=0,
response_format={"type": "json_object"},
)
result = json.loads(resp.choices[0].message.content)
result["score"] = result["score"] / 10.0 # normalize to 0-1
return result
# Usage example
case = {
"question": "What is the weather in London today?",
"actual": "London today is 18°C, sunny with good air quality.",
"rules": {"must_contain": ["temperature|°C|degrees", "weather|sunny|cloudy|rain"],
"must_not_contain": ["I don't know", "unable to answer"]}
}
print(regex_match(case["actual"], case["rules"]))
# {"passed": True, "score": 1.0, "failures": []}
Metrics and scoring
Show primary metrics (task success, tool-call correctness) separately from secondary ones (p95 latency, tokens, cost); judge scores should report confidence intervals or two-rater agreement. Use the toggle below for operational definitions of common metrics.
- Task success: meets expected behavior and triggers no forbidden outcome.
- Tool correctness: call name, parameter schema, and business preconditions pass static or replay checks.
- Judge agreement score: two judges or repeated samples under variance threshold count as a valid example.
Pass-rate calculator (passed / total)
CI integration and PR eval report
GitHub Actions workflow snippet—run smoke suite on PRs, full suite nightly:
# .github/workflows/eval.yml
name: Agent Eval
on:
pull_request:
branches: [main]
schedule:
- cron: "0 2 * * *" # Run full suite daily at UTC 02:00
jobs:
eval-smoke:
if: github.event_name == 'pull_request'
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.12"
- run: pip install -r requirements-eval.txt
- name: Run smoke eval
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
EVAL_MODEL: "gpt-4o-2024-11-20"
EVAL_SEED: "42"
run: |
python -m eval.runner \
--suite smoke \
--threshold 0.90 \
--output eval-report.json
- name: Post PR comment with eval results
uses: actions/github-script@v7
with:
script: |
const fs = require('fs');
const report = JSON.parse(fs.readFileSync('eval-report.json', 'utf8'));
const emoji = report.passed ? '✅' : '❌';
const body = `## ${emoji} Eval Report\n\n` +
`| Metric | This run | Baseline | Delta |\n` +
`|--------|----------|----------|-------|\n` +
`| Pass rate | ${report.pass_rate}% | ${report.baseline_pass_rate}% | ${report.delta > 0 ? '+' : ''}${report.delta}% |\n` +
`| Tool accuracy | ${report.tool_accuracy}% | - | - |\n` +
`| Latency p95 | ${report.latency_p95_ms}ms | - | - |\n\n` +
(report.failed_cases.length > 0
? `**Failed cases (top 3):**\n` + report.failed_cases.slice(0,3).map(c =>
`- \`${c.id}\`: ${c.reason}`).join('\n')
: 'All cases passed');
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body
});
- name: Fail if below threshold
run: python -c "import json,sys; r=json.load(open('eval-report.json')); sys.exit(0 if r['passed'] else 1)"
- uses: actions/upload-artifact@v4
if: always()
with:
name: eval-report
path: eval-report.json
eval-full:
if: github.event_name == 'schedule'
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with: {python-version: "3.12"}
- run: pip install -r requirements-eval.txt
- run: python -m eval.runner --suite full --threshold 0.85 --output eval-full-report.json
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
- uses: actions/upload-artifact@v4
if: always()
with:
name: eval-full-report-${{ github.run_id }}
path: eval-full-report.json
JSON report format for the PR regression gate:
{
"suite": "smoke",
"model": "gpt-4o-2024-11-20",
"seed": 42,
"run_at": "2026-04-11T10:30:00Z",
"total": 50,
"passed_count": 47,
"pass_rate": 94.0,
"baseline_pass_rate": 92.0,
"delta": 2.0,
"passed": true,
"threshold": 90.0,
"tool_accuracy": 96.0,
"latency_p95_ms": 1230,
"failed_cases": [
{
"id": "tc_023",
"eval_fn": "llm_judge",
"score": 0.4,
"reason": "Answer lacks specific values; only qualitative description provided",
"actual": "The weather is nice",
"expected_pattern": "temperature|°C|degrees"
}
],
"artifacts": {
"traces": "gs://eval-artifacts/run-20260411/traces/",
"prompts": "gs://eval-artifacts/run-20260411/prompts/"
}
}
Golden tests and snapshots
Golden paths use fixed inputs and expected trajectories (tool sequences, key intermediate fields); for natural-language output prefer "structured subset + loose matching" instead of whole-string diff. Snapshot updates need two-person review or an automated PR label—avoid silent drift.
---
name: eval-harness-ci
description: Design agent eval sets and regression gates; input: feature description or existing case set; output: tiered case set + CI workflow; prohibit: using production PII data as test cases
version: "1.1.0"
triggers:
- "how to eval.*agent|design.*eval.*(set|suite)"
- "CI.*eval|regression.*gate|eval.*harness"
steps:
1. Tier by purpose: smoke (<5min) / regression (<60min) / adversarial (ad hoc)
2. Each case contains six fields: id/input/expected/eval_fn/tags/metadata
3. metadata must include model_version/seed/pii_level/license
4. Implement exact_match: string equality, for structured JSON output
5. Implement regex_match: must_contain/must_not_contain rule checks
6. Implement llm_judge: returns score(0-1)/passed/reason, temperature=0
7. Calibrate llm_judge: run each case 3 times; flag as flaky if variance > 0.2
8. Write GitHub Actions workflow: PR triggers smoke, schedule triggers full
9. PR comment shows pass rate, tool accuracy, p95 latency, delta vs baseline
10. Fail CI and block merge if pass rate < threshold (smoke: 90%, full: 85%)
11. Upload failed artifacts (traces/prompts paths must be stable)
12. Quarantine flaky cases; exclude from pass rate but track as tech debt
13. Bump version when dataset changes; PR report shows both old and new curves
constraints:
- Do NOT use production PII data; redact or use synthetic data
- Do NOT hardcode API keys in CI (use GitHub Secrets)
- Nightly full suite and MR smoke suite use separate quotas; never share