评测与基准
本页给出评测用例的完整 JSON 格式(input/expected/eval_fn/tags/metadata)、三种评测函数实现(exact_match/regex_match/llm_judge)、GitHub Actions CI 集成片段,以及 PR 中显示评测结果的 JSON 报告格式。
评测用例的完整 JSON 格式——每条用例包含 input/expected/eval_fn/tags/metadata:
{
"id": "tc_001",
"input": {
"messages": [
{"role": "user", "content": "北京今天的天气怎么样?"}
],
"tools": ["get_weather"]
},
"expected": {
"tool_calls": [
{"name": "get_weather", "arguments": {"city": "北京"}}
],
"output_contains": ["温度", "天气"],
"output_not_contains": ["抱歉,我无法", "我不知道"]
},
"eval_fn": "tool_call_match",
"tags": ["smoke", "tool-use", "weather"],
"metadata": {
"created_by": "alice@example.com",
"created_at": "2026-04-01",
"last_reviewed": "2026-04-10",
"model_version": "gpt-4o-2024-11-20",
"seed": 42,
"pii_level": "none",
"license": "internal"
}
}
[ 用例集 / 版本与 seed 钉扎 ]
│
▼
[ Runner 批量执行 ]
│
┌────┴────┐
▼ ▼
[ 规则判分 ] [ Judge / 抽检 ]
│ │
└────┬────┘
▼
[ 聚合指标 & 阈值门禁 ]
│
┌────┴────────┐
▼ ▼
[ 趋势与基线 ] [ 失败产物归档 ]
三种评测函数实现
import re, json
from openai import OpenAI
client = OpenAI()
# 1. exact_match:输出与期望完全一致(适合结构化输出)
def exact_match(actual: str, expected: str) -> dict:
passed = actual.strip() == expected.strip()
return {"passed": passed, "score": 1.0 if passed else 0.0,
"reason": "完全匹配" if passed else f"期望: {expected!r},实际: {actual!r}"}
# 2. regex_match:用正则检查输出是否包含/不包含特定模式
def regex_match(actual: str, rules: dict) -> dict:
"""
rules 格式:
{
"must_contain": ["pattern1", "pattern2"],
"must_not_contain": ["bad_pattern"],
}
"""
failures = []
for pattern in rules.get("must_contain", []):
if not re.search(pattern, actual, re.IGNORECASE):
failures.append(f"缺少必须包含的模式: {pattern!r}")
for pattern in rules.get("must_not_contain", []):
if re.search(pattern, actual, re.IGNORECASE):
failures.append(f"包含了禁止出现的模式: {pattern!r}")
passed = len(failures) == 0
return {"passed": passed, "score": 1.0 if passed else 0.0,
"failures": failures}
# 3. llm_judge:用 GPT-4o 评判输出质量(适合自然语言输出)
def llm_judge(
question: str,
actual: str,
criteria: str,
model: str = "gpt-4o",
) -> dict:
"""
用 LLM 判断输出是否满足评判标准。
返回 score(0-1)和 reason。
"""
prompt = f"""请评判以下回答是否满足标准。
问题:{question}
回答:{actual}
评判标准:{criteria}
请以 JSON 格式回答:
{{"score": 0-10, "passed": true/false, "reason": "评判理由(一句话)"}}
只输出 JSON,不要其他内容。"""
resp = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
temperature=0,
response_format={"type": "json_object"},
)
result = json.loads(resp.choices[0].message.content)
result["score"] = result["score"] / 10.0 # 归一化到 0-1
return result
# 使用示例
case = {
"question": "北京今天的天气怎么样?",
"actual": "北京今天气温 18°C,晴天,空气质量良好。",
"rules": {"must_contain": ["温度|°C", "天气|晴|阴|雨"], "must_not_contain": ["我不知道", "无法回答"]}
}
print(regex_match(case["actual"], case["rules"]))
# {"passed": True, "score": 1.0, "failures": []}
指标与判分
主指标(任务成功率、工具调用正确率)与辅指标(延迟 p95、token、成本)分开展示;Judge 分数需报告置信区间或双人一致性。下方可切换查看常用指标的操作定义。
- 任务成功:满足期望行为且未触发任一禁止项。
- 工具正确率:调用名称、参数 schema、与业务前置条件均通过静态或回放校验。
- Judge 一致分:双裁判或同 prompt 多次采样方差低于阈值才计为有效样本。
通过率试算(通过条数 / 总条数)
84.0%
CI 集成与 PR 评测报告
GitHub Actions workflow 片段——PR 时运行冒烟集,夜间运行全量集:
# .github/workflows/eval.yml
name: Agent Eval
on:
pull_request:
branches: [main]
schedule:
- cron: "0 2 * * *" # 每天 UTC 02:00 跑全量集
jobs:
eval-smoke:
if: github.event_name == 'pull_request'
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.12"
- run: pip install -r requirements-eval.txt
- name: Run smoke eval
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
EVAL_MODEL: "gpt-4o-2024-11-20"
EVAL_SEED: "42"
run: |
python -m eval.runner \
--suite smoke \
--threshold 0.90 \
--output eval-report.json
- name: Post PR comment with eval results
uses: actions/github-script@v7
with:
script: |
const fs = require('fs');
const report = JSON.parse(fs.readFileSync('eval-report.json', 'utf8'));
const emoji = report.passed ? '✅' : '❌';
const body = `## ${emoji} Eval Report\n\n` +
`| 指标 | 本次 | 基线 | 变化 |\n` +
`|------|------|------|------|\n` +
`| 通过率 | ${report.pass_rate}% | ${report.baseline_pass_rate}% | ${report.delta > 0 ? '+' : ''}${report.delta}% |\n` +
`| 工具调用正确率 | ${report.tool_accuracy}% | - | - |\n` +
`| 延迟 p95 | ${report.latency_p95_ms}ms | - | - |\n\n` +
(report.failed_cases.length > 0
? `**失败用例(前 3):**\n` + report.failed_cases.slice(0,3).map(c =>
`- \`${c.id}\`: ${c.reason}`).join('\n')
: '所有用例通过');
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body
});
- name: Fail if below threshold
run: python -c "import json,sys; r=json.load(open('eval-report.json')); sys.exit(0 if r['passed'] else 1)"
- uses: actions/upload-artifact@v4
if: always()
with:
name: eval-report
path: eval-report.json
eval-full:
if: github.event_name == 'schedule'
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with: {python-version: "3.12"}
- run: pip install -r requirements-eval.txt
- run: python -m eval.runner --suite full --threshold 0.85 --output eval-full-report.json
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
- uses: actions/upload-artifact@v4
if: always()
with:
name: eval-full-report-${{ github.run_id }}
path: eval-full-report.json
PR 回归门禁的 JSON 报告格式:
{
"suite": "smoke",
"model": "gpt-4o-2024-11-20",
"seed": 42,
"run_at": "2026-04-11T10:30:00Z",
"total": 50,
"passed_count": 47,
"pass_rate": 94.0,
"baseline_pass_rate": 92.0,
"delta": 2.0,
"passed": true,
"threshold": 90.0,
"tool_accuracy": 96.0,
"latency_p95_ms": 1230,
"failed_cases": [
{
"id": "tc_023",
"eval_fn": "llm_judge",
"score": 0.4,
"reason": "回答缺少具体数值,仅给出定性描述",
"actual": "天气不错",
"expected_pattern": "温度|°C"
}
],
"artifacts": {
"traces": "gs://eval-artifacts/run-20260411/traces/",
"prompts": "gs://eval-artifacts/run-20260411/prompts/"
}
}
黄金测试与快照
黄金路径用固定输入与期望轨迹(工具序列、关键中间字段);对自然语言输出可采用「结构化子集 + 宽松匹配」而非整段 diff。快照更新必须经双人审阅或自动 PR 标签,防止静默漂移。
注意:Judge 或采样温度未锁定时,勿对长自然语言做字节级快照;优先断言 JSON 字段与引用完整性。
---
name: eval-harness-ci
description: 设计 Agent 评测集与回归门禁;输入:功能描述或已有用例集;产出:分层用例集 + CI workflow;禁止:使用生产 PII 数据作为测试用例
version: "1.1.0"
triggers:
- "如何评测.*Agent|设计.*评测集"
- "CI.*eval|回归.*门禁|eval.*harness"
steps:
1. 按用途分层:smoke(<5min)/ regression(<60min)/ adversarial(不定期)
2. 每条用例包含 id/input/expected/eval_fn/tags/metadata 六个字段
3. metadata 必须包含 model_version/seed/pii_level/license
4. 实现 exact_match:字符串全等,用于结构化 JSON 输出
5. 实现 regex_match:must_contain/must_not_contain 规则检查
6. 实现 llm_judge:返回 score(0-1)/passed/reason,temperature=0
7. llm_judge 结果需校准:同一用例跑 3 次,方差超 0.2 标记为 flaky
8. 写 GitHub Actions workflow:PR 触发 smoke,定时触发 full
9. PR comment 显示通过率、工具准确率、p95 延迟、与基线差分
10. 通过率低于 threshold(smoke:90%, full:85%)则 CI 失败阻断合并
11. 失败产物上传 artifacts(traces/prompts 路径固定)
12. flaky 用例加入 quarantine 列表,不计入通过率但记录技术债
13. 变更数据集时 bump 版本号,PR 报告并列展示新旧版本曲线
constraints:
- 禁止使用生产环境 PII 数据;必须脱敏或使用合成数据
- 禁止在 CI 中硬编码 API key(用 GitHub Secrets)
- 夜间全量集与 MR 冒烟集使用独立配额,禁止共享抢占