agent-skill-infra — 为 Agent Skill 提供质量基础设施

· 设计哲学

🤖不用写死的 benchmark

关键词正则只是 fallback。真正的评估通过 LLM 完成 —— Anthropic Claude 或 GitHub Models（免费）。工具适应 Skill，而不是反过来。

📋打分不能只给一个数字

每份质量报告都包含 8 维度分析 + 可执行的改进建议。3rd-party 提交 Issue 后自动收到结果。

🆓免费层级给所有人

GitHub Models（gpt-4o-mini）通过 --gh-models 零成本运行。CI 中自动注入 GITHUB_TOKEN，无需任何 API key。

📐兼容 agentskills.io 规范

解析 SKILL.md 的 YAML frontmatter（name/description）+ Markdown body。遵循 progressive disclosure 设计（≤500 行检查）。

🌐三通道覆盖

本地 CLI、GitHub Issue 自动测试、CI pipeline。同一个工具，哪里需要就在哪里跑。

🔒安全 + 版本双保险

Cisco Scanner 集成做安装前扫描，skill-version 做变更感知（diff → 行为差异 → 回滚）。

三层评估引擎，自动选择最佳方案。基于 8 维度评分 + 可执行改进建议。

层级	命令	引擎	成本	语言支持
🚀 快速自检	`skill-quality <path>`	关键词正则	免费	英文为主
🧠 语义评估	`skill-quality <path> --llm`	Anthropic Claude	按量	全语言
🆓 免费智能	`skill-quality <path> --gh-models`	GitHub Models	零成本	全语言

可选集成：--lint（agent-skill-linter，17 条格式规则）、--security（Cisco Scanner 安全扫描）

$ skill-quality my-skill/SKILL.md --gh-models --output json

# 使用 GitHub Models（免费），gpt-4o-mini 语义评估
$ skill-quality my-skill/SKILL.md --gh-models --output json
{
  "skill_name": "darwin-skill",
  "overall_score": 0.91,
  "total_lines": 244,
  "token_estimate": 2103,
  "dimensions": [
    {
      "name": "trigger_precision",
      "score": 0.95,
      "findings": ["Clear trigger keywords in description"]
    },
    {
      "name": "helloandy_8dim_gh_models",
      "score": 0.91,
      "findings": [
        "Summary: Well-structured skill with clear descriptions",
        "[trigger_precision] → Add specific trigger keywords",
        "[example_quality] → Include more input/output examples"
      ]
    }
  ]
}

运行 evals.json 测试套件。5 种判定器，从关键词到语义等价。

判定器	用途	示例
keyword	关键词匹配（any/all 模式）	输出是否包含预期关键词
schema	JSON Schema 验证	输出是否符合 JSON Schema
llm	LLM-as-Judge（语义等价）	语义等价判断（Anthropic API）
flow	工具调用序列校验	Agent 是否按预期顺序调用工具
snapshot	快照对比（回归检测）	输出是否与基线快照一致

$ skill-test run evals.json --adapter mock

$ skill-test run docs/examples/demo-skill/evals.json --adapter mock
Running 5 test cases with 'mock' adapter...

┌──────────────────────────────┬──────┬───────┬──────────┬────────────────┐
│ Case ID                      │ Pass │ Score │ Time(ms) │ Reason         │
├──────────────────────────────┼──────┼───────┼──────────┼────────────────┤
│ ✓ should-contain-report…     │ PASS │ 0.750 │       0  │ 3/4 keywords   │
│ ✓ should-detect-security…    │ PASS │ 0.800 │       0  │ 4/5 keywords   │
│ ✗ should-not-trigger…        │ FAIL │ 0.000 │       0  │ no keywords    │
│ ✗ output-should-be…          │ FAIL │ 0.250 │       0  │ 1/4 keywords   │
│ ✗ should-handle-error…       │ FAIL │ 0.000 │       0  │ no keywords    │
├──────────────────────────────┼──────┼───────┼──────────┼────────────────┤
│ Total: 5                     │ Pass │ Fail  │ Rate     │ Time: 0ms      │
│                              │  2   │  3    │ 40.0%   │                │
└──────────────────────────────┴──────┴───────┴──────────┴────────────────┘

agentskills.io spec 只定义了 metadata.version —— 版本感知填上 diff、回滚、回归检测的空白。

子命令	功能	示例
diff	结构化 diff（table/json）	`skill-version diff . --old-ref HEAD~3`
check	diff + 安全分析	`skill-version check . --security`
rollback	一键回滚	`skill-version rollback . --target-ref HEAD~1 --yes`
baseline store	存储快照基线	`skill-version baseline store . case-1 out.txt`
baseline detect	检测行为回归	`skill-version baseline detect . case-1 out.txt`

$ skill-version

$ skill-version diff . --old-ref HEAD~3 --new-ref HEAD
Version Diff: 0a1b2c3d... -> e4f5a6b7...
4 file(s) changed:
  modified  src/skill_infra/test_runner/judgers/llm_judge.py    +85  -0
  modified  src/skill_infra/version_aware/cli.py                +148 -0
  modified  pyproject.toml                                      +15  -2
  added     README.md                                           +56  -0

$ skill-version check . --security
Security: clean  ·  Max severity: none

Test it. Score it. Ship it.