评测与基准

本页给出评测用例的完整 JSON 格式(input/expected/eval_fn/tags/metadata)、三种评测函数实现(exact_match/regex_match/llm_judge)、GitHub Actions CI 集成片段,以及 PR 中显示评测结果的 JSON 报告格式。

评测用例的完整 JSON 格式——每条用例包含 input/expected/eval_fn/tags/metadata:

{
  "id": "tc_001",
  "input": {
    "messages": [
      {"role": "user", "content": "北京今天的天气怎么样?"}
    ],
    "tools": ["get_weather"]
  },
  "expected": {
    "tool_calls": [
      {"name": "get_weather", "arguments": {"city": "北京"}}
    ],
    "output_contains": ["温度", "天气"],
    "output_not_contains": ["抱歉,我无法", "我不知道"]
  },
  "eval_fn": "tool_call_match",
  "tags": ["smoke", "tool-use", "weather"],
  "metadata": {
    "created_by": "alice@example.com",
    "created_at": "2026-04-01",
    "last_reviewed": "2026-04-10",
    "model_version": "gpt-4o-2024-11-20",
    "seed": 42,
    "pii_level": "none",
    "license": "internal"
  }
}
[ 用例集 / 版本与 seed 钉扎 ]
              │
              ▼
        [ Runner 批量执行 ]
              │
         ┌────┴────┐
         ▼         ▼
   [ 规则判分 ]  [ Judge / 抽检 ]
         │         │
         └────┬────┘
              ▼
    [ 聚合指标 & 阈值门禁 ]
              │
         ┌────┴────────┐
         ▼             ▼
  [ 趋势与基线 ]  [ 失败产物归档 ]

三种评测函数实现

import re, json
from openai import OpenAI

client = OpenAI()

# 1. exact_match:输出与期望完全一致(适合结构化输出)
def exact_match(actual: str, expected: str) -> dict:
    passed = actual.strip() == expected.strip()
    return {"passed": passed, "score": 1.0 if passed else 0.0,
            "reason": "完全匹配" if passed else f"期望: {expected!r},实际: {actual!r}"}

# 2. regex_match:用正则检查输出是否包含/不包含特定模式
def regex_match(actual: str, rules: dict) -> dict:
    """
    rules 格式:
      {
        "must_contain": ["pattern1", "pattern2"],
        "must_not_contain": ["bad_pattern"],
      }
    """
    failures = []
    for pattern in rules.get("must_contain", []):
        if not re.search(pattern, actual, re.IGNORECASE):
            failures.append(f"缺少必须包含的模式: {pattern!r}")
    for pattern in rules.get("must_not_contain", []):
        if re.search(pattern, actual, re.IGNORECASE):
            failures.append(f"包含了禁止出现的模式: {pattern!r}")
    passed = len(failures) == 0
    return {"passed": passed, "score": 1.0 if passed else 0.0,
            "failures": failures}

# 3. llm_judge:用 GPT-4o 评判输出质量(适合自然语言输出)
def llm_judge(
    question: str,
    actual: str,
    criteria: str,
    model: str = "gpt-4o",
) -> dict:
    """
    用 LLM 判断输出是否满足评判标准。
    返回 score(0-1)和 reason。
    """
    prompt = f"""请评判以下回答是否满足标准。

问题:{question}

回答:{actual}

评判标准:{criteria}

请以 JSON 格式回答:
{{"score": 0-10, "passed": true/false, "reason": "评判理由(一句话)"}}
只输出 JSON,不要其他内容。"""

    resp = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
        response_format={"type": "json_object"},
    )
    result = json.loads(resp.choices[0].message.content)
    result["score"] = result["score"] / 10.0  # 归一化到 0-1
    return result

# 使用示例
case = {
    "question": "北京今天的天气怎么样?",
    "actual": "北京今天气温 18°C,晴天,空气质量良好。",
    "rules": {"must_contain": ["温度|°C", "天气|晴|阴|雨"], "must_not_contain": ["我不知道", "无法回答"]}
}
print(regex_match(case["actual"], case["rules"]))
# {"passed": True, "score": 1.0, "failures": []}

指标与判分

主指标(任务成功率、工具调用正确率)与辅指标(延迟 p95、token、成本)分开展示;Judge 分数需报告置信区间或双人一致性。下方可切换查看常用指标的操作定义。

通过率试算(通过条数 / 总条数)

84.0%

CI 集成与 PR 评测报告

GitHub Actions workflow 片段——PR 时运行冒烟集,夜间运行全量集:

# .github/workflows/eval.yml
name: Agent Eval

on:
  pull_request:
    branches: [main]
  schedule:
    - cron: "0 2 * * *"   # 每天 UTC 02:00 跑全量集

jobs:
  eval-smoke:
    if: github.event_name == 'pull_request'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"
      - run: pip install -r requirements-eval.txt
      - name: Run smoke eval
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          EVAL_MODEL: "gpt-4o-2024-11-20"
          EVAL_SEED: "42"
        run: |
          python -m eval.runner \
            --suite smoke \
            --threshold 0.90 \
            --output eval-report.json
      - name: Post PR comment with eval results
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const report = JSON.parse(fs.readFileSync('eval-report.json', 'utf8'));
            const emoji = report.passed ? '✅' : '❌';
            const body = `## ${emoji} Eval Report\n\n` +
              `| 指标 | 本次 | 基线 | 变化 |\n` +
              `|------|------|------|------|\n` +
              `| 通过率 | ${report.pass_rate}% | ${report.baseline_pass_rate}% | ${report.delta > 0 ? '+' : ''}${report.delta}% |\n` +
              `| 工具调用正确率 | ${report.tool_accuracy}% | - | - |\n` +
              `| 延迟 p95 | ${report.latency_p95_ms}ms | - | - |\n\n` +
              (report.failed_cases.length > 0
                ? `**失败用例(前 3):**\n` + report.failed_cases.slice(0,3).map(c =>
                    `- \`${c.id}\`: ${c.reason}`).join('\n')
                : '所有用例通过');
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body
            });
      - name: Fail if below threshold
        run: python -c "import json,sys; r=json.load(open('eval-report.json')); sys.exit(0 if r['passed'] else 1)"
      - uses: actions/upload-artifact@v4
        if: always()
        with:
          name: eval-report
          path: eval-report.json

  eval-full:
    if: github.event_name == 'schedule'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: {python-version: "3.12"}
      - run: pip install -r requirements-eval.txt
      - run: python -m eval.runner --suite full --threshold 0.85 --output eval-full-report.json
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
      - uses: actions/upload-artifact@v4
        if: always()
        with:
          name: eval-full-report-${{ github.run_id }}
          path: eval-full-report.json

PR 回归门禁的 JSON 报告格式:

{
  "suite": "smoke",
  "model": "gpt-4o-2024-11-20",
  "seed": 42,
  "run_at": "2026-04-11T10:30:00Z",
  "total": 50,
  "passed_count": 47,
  "pass_rate": 94.0,
  "baseline_pass_rate": 92.0,
  "delta": 2.0,
  "passed": true,
  "threshold": 90.0,
  "tool_accuracy": 96.0,
  "latency_p95_ms": 1230,
  "failed_cases": [
    {
      "id": "tc_023",
      "eval_fn": "llm_judge",
      "score": 0.4,
      "reason": "回答缺少具体数值,仅给出定性描述",
      "actual": "天气不错",
      "expected_pattern": "温度|°C"
    }
  ],
  "artifacts": {
    "traces": "gs://eval-artifacts/run-20260411/traces/",
    "prompts": "gs://eval-artifacts/run-20260411/prompts/"
  }
}

黄金测试与快照

黄金路径用固定输入与期望轨迹(工具序列、关键中间字段);对自然语言输出可采用「结构化子集 + 宽松匹配」而非整段 diff。快照更新必须经双人审阅或自动 PR 标签,防止静默漂移。

注意:Judge 或采样温度未锁定时,勿对长自然语言做字节级快照;优先断言 JSON 字段与引用完整性。
---
name: eval-harness-ci
description: 设计 Agent 评测集与回归门禁;输入:功能描述或已有用例集;产出:分层用例集 + CI workflow;禁止:使用生产 PII 数据作为测试用例
version: "1.1.0"
triggers:
  - "如何评测.*Agent|设计.*评测集"
  - "CI.*eval|回归.*门禁|eval.*harness"
steps:
  1. 按用途分层:smoke(<5min)/ regression(<60min)/ adversarial(不定期)
  2. 每条用例包含 id/input/expected/eval_fn/tags/metadata 六个字段
  3. metadata 必须包含 model_version/seed/pii_level/license
  4. 实现 exact_match:字符串全等,用于结构化 JSON 输出
  5. 实现 regex_match:must_contain/must_not_contain 规则检查
  6. 实现 llm_judge:返回 score(0-1)/passed/reason,temperature=0
  7. llm_judge 结果需校准:同一用例跑 3 次,方差超 0.2 标记为 flaky
  8. 写 GitHub Actions workflow:PR 触发 smoke,定时触发 full
  9. PR comment 显示通过率、工具准确率、p95 延迟、与基线差分
  10. 通过率低于 threshold(smoke:90%, full:85%)则 CI 失败阻断合并
  11. 失败产物上传 artifacts(traces/prompts 路径固定)
  12. flaky 用例加入 quarantine 列表,不计入通过率但记录技术债
  13. 变更数据集时 bump 版本号,PR 报告并列展示新旧版本曲线
constraints:
  - 禁止使用生产环境 PII 数据;必须脱敏或使用合成数据
  - 禁止在 CI 中硬编码 API key(用 GitHub Secrets)
  - 夜间全量集与 MR 冒烟集使用独立配额,禁止共享抢占

返回技能库 更多技能入口