Evaluation & benchmarks

This page provides the complete JSON format for eval test cases (input/expected/eval_fn/tags/metadata), three eval function implementations (exact_match/regex_match/llm_judge), a GitHub Actions CI integration snippet, and a JSON report format for displaying eval results in PRs.

Complete JSON format for an eval test case—each case contains input/expected/eval_fn/tags/metadata:

{
  "id": "tc_001",
  "input": {
    "messages": [
      {"role": "user", "content": "What is the weather in London today?"}
    ],
    "tools": ["get_weather"]
  },
  "expected": {
    "tool_calls": [
      {"name": "get_weather", "arguments": {"city": "London"}}
    ],
    "output_contains": ["temperature", "weather"],
    "output_not_contains": ["I cannot", "I don't know"]
  },
  "eval_fn": "tool_call_match",
  "tags": ["smoke", "tool-use", "weather"],
  "metadata": {
    "created_by": "alice@example.com",
    "created_at": "2026-04-01",
    "last_reviewed": "2026-04-10",
    "model_version": "gpt-4o-2024-11-20",
    "seed": 42,
    "pii_level": "none",
    "license": "internal"
  }
}

[ Case set / version & seed pinned ]
              │
              ▼
        [ Runner batch execute ]
              │
         ┌────┴────┐
         ▼         ▼
   [ Rule score ]  [ Judge / spot check ]
         │         │
         └────┬────┘
              ▼
    [ Aggregate metrics & threshold gates ]
              │
         ┌────┴────────┐
         ▼             ▼
  [ Trends & baseline ]  [ Failed artifact archive ]

Three eval function implementations

import re, json
from openai import OpenAI

client = OpenAI()

# 1. exact_match: output must equal expected exactly (for structured output)
def exact_match(actual: str, expected: str) -> dict:
    passed = actual.strip() == expected.strip()
    return {"passed": passed, "score": 1.0 if passed else 0.0,
            "reason": "exact match" if passed else f"expected: {expected!r}, got: {actual!r}"}

# 2. regex_match: check whether output contains/excludes specific patterns
def regex_match(actual: str, rules: dict) -> dict:
    """
    rules format:
      {
        "must_contain": ["pattern1", "pattern2"],
        "must_not_contain": ["bad_pattern"],
      }
    """
    failures = []
    for pattern in rules.get("must_contain", []):
        if not re.search(pattern, actual, re.IGNORECASE):
            failures.append(f"missing required pattern: {pattern!r}")
    for pattern in rules.get("must_not_contain", []):
        if re.search(pattern, actual, re.IGNORECASE):
            failures.append(f"contains forbidden pattern: {pattern!r}")
    passed = len(failures) == 0
    return {"passed": passed, "score": 1.0 if passed else 0.0,
            "failures": failures}

# 3. llm_judge: use GPT-4o to judge output quality (for natural-language output)
def llm_judge(
    question: str,
    actual: str,
    criteria: str,
    model: str = "gpt-4o",
) -> dict:
    """
    Use an LLM to judge whether the output meets the evaluation criteria.
    Returns score (0-1) and reason.
    """
    prompt = f"""Please evaluate whether the following answer meets the criteria.

Question: {question}

Answer: {actual}

Evaluation criteria: {criteria}

Respond in JSON format:
{{"score": 0-10, "passed": true/false, "reason": "one-sentence justification"}}
Output JSON only, nothing else."""

    resp = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
        response_format={"type": "json_object"},
    )
    result = json.loads(resp.choices[0].message.content)
    result["score"] = result["score"] / 10.0  # normalize to 0-1
    return result

# Usage example
case = {
    "question": "What is the weather in London today?",
    "actual": "London today is 18°C, sunny with good air quality.",
    "rules": {"must_contain": ["temperature|°C|degrees", "weather|sunny|cloudy|rain"],
              "must_not_contain": ["I don't know", "unable to answer"]}
}
print(regex_match(case["actual"], case["rules"]))
# {"passed": True, "score": 1.0, "failures": []}

Metrics and scoring

Show primary metrics (task success, tool-call correctness) separately from secondary ones (p95 latency, tokens, cost); judge scores should report confidence intervals or two-rater agreement. Use the toggle below for operational definitions of common metrics.

Pass-rate calculator (passed / total)

Passed Total 84.0%

CI integration and PR eval report

GitHub Actions workflow snippet—run smoke suite on PRs, full suite nightly:

# .github/workflows/eval.yml
name: Agent Eval

on:
  pull_request:
    branches: [main]
  schedule:
    - cron: "0 2 * * *"   # Run full suite daily at UTC 02:00

jobs:
  eval-smoke:
    if: github.event_name == 'pull_request'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"
      - run: pip install -r requirements-eval.txt
      - name: Run smoke eval
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          EVAL_MODEL: "gpt-4o-2024-11-20"
          EVAL_SEED: "42"
        run: |
          python -m eval.runner \
            --suite smoke \
            --threshold 0.90 \
            --output eval-report.json
      - name: Post PR comment with eval results
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const report = JSON.parse(fs.readFileSync('eval-report.json', 'utf8'));
            const emoji = report.passed ? '✅' : '❌';
            const body = `## ${emoji} Eval Report\n\n` +
              `| Metric | This run | Baseline | Delta |\n` +
              `|--------|----------|----------|-------|\n` +
              `| Pass rate | ${report.pass_rate}% | ${report.baseline_pass_rate}% | ${report.delta > 0 ? '+' : ''}${report.delta}% |\n` +
              `| Tool accuracy | ${report.tool_accuracy}% | - | - |\n` +
              `| Latency p95 | ${report.latency_p95_ms}ms | - | - |\n\n` +
              (report.failed_cases.length > 0
                ? `**Failed cases (top 3):**\n` + report.failed_cases.slice(0,3).map(c =>
                    `- \`${c.id}\`: ${c.reason}`).join('\n')
                : 'All cases passed');
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body
            });
      - name: Fail if below threshold
        run: python -c "import json,sys; r=json.load(open('eval-report.json')); sys.exit(0 if r['passed'] else 1)"
      - uses: actions/upload-artifact@v4
        if: always()
        with:
          name: eval-report
          path: eval-report.json

  eval-full:
    if: github.event_name == 'schedule'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: {python-version: "3.12"}
      - run: pip install -r requirements-eval.txt
      - run: python -m eval.runner --suite full --threshold 0.85 --output eval-full-report.json
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
      - uses: actions/upload-artifact@v4
        if: always()
        with:
          name: eval-full-report-${{ github.run_id }}
          path: eval-full-report.json

JSON report format for the PR regression gate:

{
  "suite": "smoke",
  "model": "gpt-4o-2024-11-20",
  "seed": 42,
  "run_at": "2026-04-11T10:30:00Z",
  "total": 50,
  "passed_count": 47,
  "pass_rate": 94.0,
  "baseline_pass_rate": 92.0,
  "delta": 2.0,
  "passed": true,
  "threshold": 90.0,
  "tool_accuracy": 96.0,
  "latency_p95_ms": 1230,
  "failed_cases": [
    {
      "id": "tc_023",
      "eval_fn": "llm_judge",
      "score": 0.4,
      "reason": "Answer lacks specific values; only qualitative description provided",
      "actual": "The weather is nice",
      "expected_pattern": "temperature|°C|degrees"
    }
  ],
  "artifacts": {
    "traces": "gs://eval-artifacts/run-20260411/traces/",
    "prompts": "gs://eval-artifacts/run-20260411/prompts/"
  }
}

Golden tests and snapshots

Golden paths use fixed inputs and expected trajectories (tool sequences, key intermediate fields); for natural-language output prefer "structured subset + loose matching" instead of whole-string diff. Snapshot updates need two-person review or an automated PR label—avoid silent drift.

Note: do not byte-snapshot long natural language when judge or sampling temperature is not locked; assert JSON fields and citation integrity first.

---
name: eval-harness-ci
description: Design agent eval sets and regression gates; input: feature description or existing case set; output: tiered case set + CI workflow; prohibit: using production PII data as test cases
version: "1.1.0"
triggers:
  - "how to eval.*agent|design.*eval.*(set|suite)"
  - "CI.*eval|regression.*gate|eval.*harness"
steps:
  1. Tier by purpose: smoke (<5min) / regression (<60min) / adversarial (ad hoc)
  2. Each case contains six fields: id/input/expected/eval_fn/tags/metadata
  3. metadata must include model_version/seed/pii_level/license
  4. Implement exact_match: string equality, for structured JSON output
  5. Implement regex_match: must_contain/must_not_contain rule checks
  6. Implement llm_judge: returns score(0-1)/passed/reason, temperature=0
  7. Calibrate llm_judge: run each case 3 times; flag as flaky if variance > 0.2
  8. Write GitHub Actions workflow: PR triggers smoke, schedule triggers full
  9. PR comment shows pass rate, tool accuracy, p95 latency, delta vs baseline
  10. Fail CI and block merge if pass rate < threshold (smoke: 90%, full: 85%)
  11. Upload failed artifacts (traces/prompts paths must be stable)
  12. Quarantine flaky cases; exclude from pass rate but track as tech debt
  13. Bump version when dataset changes; PR report shows both old and new curves
constraints:
  - Do NOT use production PII data; redact or use synthetic data
  - Do NOT hardcode API keys in CI (use GitHub Secrets)
  - Nightly full suite and MR smoke suite use separate quotas; never share

Back to skills More skills