behavioral-evalsby google-gemini

Guidance for creating, running, fixing, and promoting behavioral evaluations. Use when verifying agent decision logic, debugging failures, debugging prompt…

npx skills add https://github.com/google-gemini/gemini-cli --skill behavioral-evals

Behavioral Evals

Overview

Behavioral evaluations (evals) are tests that validate the agent's decision-making (e.g., tool choice) rather than pure functionality. They are critical for verifying prompt changes, debugging steerability, and preventing regressions.

[!NOTE] Single Source of Truth: For core concepts, policies, running tests, and general best practices, always refer to evals/README.md.


🔄 Workflow Decision Tree

  1. Does a prompt/tool change need validation?
    • No -> Normal integration tests.
    • Yes -> Continue below.
  2. Is it UI/Interaction heavy?
  3. Is it a new test?
    • Yes -> Set policy to USUALLY_PASSES.
    • No -> ALWAYS_PASSES (locks in regression).
  4. Are you fixing a failure or promoting a test?

📋 Quick Checklist

1. Setup Workspace

Seed the workspace with necessary files using the files object to simulate a realistic scenario (e.g., NodeJS project with package.json).

2. Write Assertions

Audit agent decisions using rig.setBreakpoint() (AppRig only) or index verification on rig.readToolLogs().

3. Verify

Run single tests locally with Vitest. Confirm stability locally before relying on CI workflows.


📦 Bundled Resources

Detailed procedural guides:

  • creating.md: Assertion strategies, Rig selection, Mock MCPs.
  • fixing.md: Step-by-step automated investigation, architecture diagnosis guidelines.
  • promoting.md: Candidate identification criteria and threshold guidelines.

More skills from google-gemini

greeter
by google-gemini
A friendly greeter skill
async-pr-review
by google-gemini
Trigger this skill when the user wants to start an asynchronous PR review, run background checks on a PR, or check the status of a previously started async PR…
ci
by google-gemini
A specialized skill for Gemini CLI that provides high-performance, fail-fast
code-reviewer
by google-gemini
Automated code review for local changes and remote pull requests with structured analysis across correctness, maintainability, and security. Supports both local file system changes (staged and unstaged) and remote PRs (by number or URL) with automatic GitHub CLI checkout Analyzes code across seven dimensions: correctness, maintainability, readability, efficiency, security, edge case handling, and test coverage Runs optional preflight verification suites (e.g., npm run preflight ) to catch...
docs-changelog
by google-gemini
Generates and formats changelog files for new releases with version-aware templates and highlight extraction. Handles three release types: stable minor versions, stable patches, and preview releases, each with distinct file update procedures Automatically processes raw markdown release notes by reformatting PR URLs to markdown links and removing contributor sections Generates concise 3–5 point highlight summaries for release announcements, prioritizing new features over bug fixes Supports...
docs-writer
by google-gemini
Technical writing and editing for Gemini CLI documentation with strict style adherence. Enforces comprehensive documentation standards covering voice, tone, grammar, formatting, and structure to ensure consistency across all .md files and /docs directory content Requires investigation of relevant code and existing documentation before making changes, with checks for impacted pages and sidebar navigation updates Applies specific rules for headings, lists, procedures, links, and accessibility,...
github-issue-creator
by google-gemini
Use this skill when asked to create a GitHub issue. It handles different issue
pirate-skill
by google-gemini
Speak like a pirate.

NotebookLM Web Importer

Import web pages and YouTube videos to NotebookLM with one click. Trusted by 200,000+ users.

Install Chrome Extension