behavioral-evals

作者: google-gemini

创建、运行、修复和推广行为评估的指南。用于验证代理决策逻辑、调试故障、调试提示…

npx skills add https://github.com/google-gemini/gemini-cli --skill behavioral-evals

Behavioral Evals

Overview

Behavioral evaluations (evals) are tests that validate the agent's decision-making (e.g., tool choice) rather than pure functionality. They are critical for verifying prompt changes, debugging steerability, and preventing regressions.

[!NOTE] Single Source of Truth: For core concepts, policies, running tests, and general best practices, always refer to evals/README.md.


🔄 Workflow Decision Tree

  1. Does a prompt/tool change need validation?
    • No -> Normal integration tests.
    • Yes -> Continue below.
  2. Is it UI/Interaction heavy?
  3. Is it a new test?
    • Yes -> Set policy to USUALLY_PASSES.
    • No -> ALWAYS_PASSES (locks in regression).
  4. Are you fixing a failure or promoting a test?

📋 Quick Checklist

1. Setup Workspace

Seed the workspace with necessary files using the files object to simulate a realistic scenario (e.g., NodeJS project with package.json).

2. Write Assertions

Audit agent decisions using rig.setBreakpoint() (AppRig only) or index verification on rig.readToolLogs().

3. Verify

Run single tests locally with Vitest. Confirm stability locally before relying on CI workflows.


📦 Bundled Resources

Detailed procedural guides:

  • creating.md: Assertion strategies, Rig selection, Mock MCPs.
  • fixing.md: Step-by-step automated investigation, architecture diagnosis guidelines.
  • promoting.md: Candidate identification criteria and threshold guidelines.

来自 google-gemini 的更多技能

greeter
google-gemini
一个友好的问候技能
official
async-pr-review
google-gemini
当用户希望启动异步PR审查、对PR进行后台检查,或查看之前启动的异步PR审查状态时,触发此技能。
official
ci
google-gemini
Gemini CLI 的一项专门技能,提供高性能、快速失败机制
official
code-reviewer
google-gemini
对本地变更和远程拉取请求进行自动化代码审查,提供涵盖正确性、可维护性和安全性的结构化分析。支持本地文件系统变更(已暂存和未暂存)以及远程PR(按编号或URL),并自动通过GitHub CLI检出。从七个维度分析代码:正确性、可维护性、可读性、效率、安全性、边界情况处理及测试覆盖率。可运行可选的预检验证套件(如npm run preflight)以提前发现问题。
official
docs-changelog
google-gemini
为新的发布版本生成并格式化变更日志文件,支持版本感知模板和亮点提取。处理三种发布类型:稳定小版本、稳定补丁和预览版本,每种类型都有不同的文件更新流程。自动处理原始Markdown发布说明,将PR URL重新格式化为Markdown链接,并移除贡献者部分。生成简洁的3-5点亮点摘要用于发布公告,优先展示新功能而非错误修复。支持...
official
docs-writer
google-gemini
针对Gemini CLI文档进行技术写作与编辑,严格遵循风格规范。强制执行全面的文档标准,涵盖语气、语调、语法、格式和结构,确保所有.md文件和/docs目录内容的一致性。在修改前需调研相关代码和现有文档,并检查受影响的页面及侧边栏导航更新。应用标题、列表、流程、链接和可访问性等具体规则...
official
github-issue-creator
google-gemini
当被要求创建GitHub issue时使用此技能。它处理不同的issue
official
pirate-skill
google-gemini
像海盗一样说话。
official