eval-suite-planner | Agent Skills Library

Purpose

This skill produces the Stage 1 artifact of the /eval-guide lifecycle: a written eval plan that a customer's PM, security partner, or business owner can sign off on. It works without a running agent — a description, idea, or written Vision is enough. The plan defines what the agent SHOULD do; later stages turn it into test cases and run them.

This is the standalone form of /eval-guide Stage 1. Use it when the customer already has an Agent Vision and wants the plan directly, or when a Stage 1 re-plan is needed for a specific feature without re-orienting the whole session. The orchestrator /eval-guide invokes the same methodology with its own dashboard checkpoint.

Knowledge sources:

Microsoft's evaluation iterative framework and evaluation checklist.
Eval Scenario Library — quality signals and method-mapping guidance.
Triage & Improvement Playbook — what makes a criterion testable.

Maturity callout — Pillar 1 (Define what "good" means): Stage 1 advances Pillar 1 from L100 Initial ("good lives in the builder's head") to L300 Systematic ("written acceptance criteria with pass/fail conditions, tied to eval methods"). The eval plan IS the Pillar 1 artifact.

Instructions

When invoked as /eval-suite-planner <agent description>:

Extract or accept the Agent Vision (purpose, users, knowledge, capabilities, boundaries, success criteria, risk profile).
Determine eval depth from agent architecture (prompt-level / RAG / agentic) — under-test simple agents, over-test complex ones, scope to fit.
Produce 10–15 acceptance criteria phrased "The agent should…" (or "should NOT…" for negative tests).
Place each criterion on the Value × Risk matrix (4 quadrants).
Consolidate quality dimensions to 4–6 broad categories (e.g., "Accuracy", "Grounding", "Boundaries / Safety", "Tone").
Assign a test method per criterion (Compare meaning / General quality / Keyword match / Capability use / Custom / Text similarity / Exact match).
Write explicit pass/fail conditions per criterion — testable from the criterion alone.
Run distribution sanity-check (red flags only).
Output the customer-ready .docx eval plan.

Do not pad responses. Do not hedge. Be specific to the described agent — no generic advice.

Step 1 — Determine eval depth from architecture

Architecture	What it is	Eval layers to apply
Prompt-level	Single-turn LLM call, fixed system prompt, no retrieval, no tools	Acceptance criteria + safety/refusal
RAG	Retrieves from knowledge sources before responding	+ Grounding + citation accuracy + hallucination prevention
Agentic	Routes between topics, tools, or connectors	+ Tool/topic routing accuracy + slot extraction + multi-step task completion

The architecture call drives which capability families apply. Don't write tool-routing tests for a simple FAQ bot.

Step 2 — Acceptance criteria on the Value × Risk matrix

Each criterion belongs in one quadrant based on two judgments:

Value — how much does getting this right drive the agent's mission?
Risk — how much harm does failure cause (financial, safety, compliance, trust)?

	Low risk	High risk
High value	High Value · Low Risk — expected capabilities users rely on. Solid coverage; occasional misses tolerable.	High Value · High Risk — product-defining; failure hurts. Heaviest investment, strictest review.
Low value	Low Value · Low Risk — exploratory or rare. Light coverage; revisit if usage grows.	Low Value · High Risk — rarely triggered but must never fail. Safety, compliance, refusals. Zero tolerance.

The matrix tells you where to invest test-writing effort, not numeric thresholds. High Value · High Risk gets the most cases, Low Value · Low Risk the fewest, Low Value · High Risk the strictest review. Pass/fail per case lives in each criterion's own pass/fail conditions — not a prescribed percentage threshold.

Step 3 — Distribution sanity-check (reference, not gate)

Targets vary by risk_profile. Targets are reference patterns, not gates. Only push back on red flags.

Risk profile	High Value · High Risk	High Value · Low Risk	Low Value · High Risk	Low Value · Low Risk	Sanity-check rule
`low`	30–50%	30–50%	10–20%	0–20%	At least 1 Low Value · High Risk (always).
`medium`	25–40%	25–40%	20–30%	0–15%	At least 1 Low Value · High Risk.
`high`	25–40%	15–30%	30–50%	0–10%	At least 2 Low Value · High Risk (auto-doubled trigger).
`critical`	20–35%	10–20%	40–60%	0–5%	At least 3 Low Value · High Risk. Compliance / Safety domains required.

Push back only on these red flags:

0 Low Value · High Risk on any plan — the agent has no enforced boundaries.
0 High Value · High Risk — the plan has no product-defining tests.
70% High Value · High Risk — every criterion is "the most important." Anchoring bias; force re-evaluation.
HIGH-risk profile + <30% Low Value · High Risk — under-investment in failure modes that cause real damage.
CRITICAL-risk profile + <40% Low Value · High Risk — same, stricter.

Marginal deviations (e.g., High Value · Low Risk at 13% with target 15–30%) are NOT red flags. Do not re-litigate customer-confirmed moves.

Step 4 — Adversarial coverage minimums (auto-applied)

Every plan needs at least 1 Low Value · High Risk / Red-Teaming criterion. The mandate auto-doubles to 2 minimum when any of these triggers fire:

Risk profile is HIGH or CRITICAL.
Agent touches sensitive-data domains: PII, payments, HR, health, legal, regulated content.
Agent has external-customer surface area.
Knowledge sources include personal or financial records.

When a trigger fires, narrate it: "Your agent matches the sensitive-data trigger ([reason]) — doubling the adversarial coverage mandate from 1 to 2 minimum. Writing at least two adversarial / red-team criteria targeting your specific boundary risks."

Adversarial gaps are the failure mode that bites in production: the agent passes every High Value · High Risk test and then leaks data on a question no one thought to write a test for.

Step 5 — Quality dimensions (consolidate to 4–6)

Group criteria by quality dimension. Default consolidated dimensions:

Accuracy (covers all knowledge sources — don't fragment into "Policy Accuracy" + "Benefits Accuracy" + "Training Accuracy").
Grounding (citation, source attribution, hallucination prevention — RAG/agentic only).
Boundaries / Safety (refusals, compliance, escalation paths).
Routing / Capability (correct tool/topic invocation — agentic only).
Tone (when relevant — empathy, brand voice, professionalism).
Personalization (when role-based access is on).

Customers fragment dimensions when the AI does. "Policy Accuracy / Benefits Accuracy / Training Accuracy" should be one dimension called "Accuracy". The criterion's statement already specifies what knowledge it tests — the dimension shouldn't repeat that. Consolidate aggressively.

Step 6 — Test methods (per criterion)

Pick the method based on what you need to verify, not on familiarity. The signal_type → method mapping:

Signal type	What you're verifying	Method
Factual content (specific facts, numbers, IDs)	Response contains the right facts	`Compare meaning` (paraphrase OK) or `Keyword match` (exact terms required)
Mandatory wording (compliance disclaimers, citations)	Specific phrases must appear	`Keyword match`
Routing / capability	Agent invoked the right tool or topic	`Capability use`
Open-ended quality (tone, helpfulness, completeness)	Subjective rubric, no single right answer	`General quality`
Domain-specific rubric (HR / medical / legal / brand)	Custom labeled judgment	`Custom` (with a per-criterion rubric)
Tight wording (templates, structured replies)	Wording closeness	`Text similarity`
Exact strings (IDs, codes, fixed responses)	Byte-exact match	`Exact match`

Reference-free methods (General quality, Capability use, Custom) grade against the criterion's own pass/fail conditions, not against a per-case reference. They don't need an "expected response" per case in Stage 2.

Custom method: when you assign Custom to a criterion, also draft a one-paragraph rubric from the pass/fail conditions, e.g.:

Rate the response Pass / Fail. Pass = [pass_condition]. Fail = [fail_condition]. Output PASS or FAIL with a one-sentence reason.

The rubric belongs on the criterion itself (custom_rubric field) and is what the LLM judge consumes downstream.

Step 7 — Pass/fail conditions per criterion

Every criterion gets explicit Pass = and Fail = lines.

Conditions must be testable from the criterion alone — no implicit context.
Pass condition names what the response must contain or do.
Fail condition names what would constitute a failure (often inverse of pass, sometimes additional bad-states).
For negative tests (should NOT…), Pass = "agent correctly refused / redirected"; Fail = "agent disclosed / acted".

Don't prescribe percentage thresholds per criterion. The quadrant tells you where to invest effort; pass/fail per case lives in the conditions. "Critical must pass at 90%" is wrong — pass/fail is per-case, not per-criterion.

Step 8 — Coverage check against the Vision

Before locking the plan, walk the Agent Vision and confirm coverage:

Every named capability has ≥ 1 criterion.
Every named boundary has ≥ 1 criterion (often a Low Value · High Risk).
Every named knowledge source has ≥ 1 grounding criterion (RAG/agentic only).
Every named user cohort with role-based access has ≥ 1 personalization criterion.

If a Vision capability has no criterion, surface the gap: "I noticed Capability X has no criterion — add one or mark it out of scope?" Don't slide gaps silently.

Step 9 — Output: customer-ready `.docx` eval plan

Use the /docx skill to generate eval-plan-<agent-name>-<YYYY-MM-DD>.docx. The report must be:

Concise — tables over paragraphs, no filler.
Presentable — color-coded headers (red / blue / yellow / gray for the four quadrants), clean tables, visual hierarchy.
Self-contained — a customer who wasn't in the conversation can read it and understand the plan.

Report structure:

Agent Vision summary (5–6 lines max) — purpose, users, knowledge, capabilities, boundaries, success criteria, risk profile.
Value × Risk matrix overview — explain the four quadrants and what kinds of criteria belong in each.
Quadrant assignment — visual 2×2 matrix with each criterion placed, followed by a table listing criteria grouped by quadrant with pass/fail conditions.
Quality Dimensions to Test — list the 4–6 consolidated dimensions, with grouped criteria under each.
Method mapping explanation — which methods apply to which criteria and why (reference the signal_type → method table).
Distribution check — actual percentages vs. risk-profile targets, with red-flag verdict.
Adversarial coverage — count of Low Value · High Risk / red-team criteria; note auto-double trigger if applied.
Next steps — "Run /eval-generator on this plan to produce test cases (Stage 2). Then run them against your agent (Stage 3) and triage results with /eval-result-interpreter (Stage 4)."

Maturity snapshot — before/after table:

Pillar	Baseline	After this plan	Next-session target
1 — Define what "good" means	L100 Initial	L300 Systematic ✓	—
2 — Build your eval sets	L100 Initial	L100 Initial	L300 (run `/eval-generator`)
4 — Improve and iterate	L100 Initial	L100 Initial	L300 (run `/eval-result-interpreter` after Stage 3)

Tell the customer: "Here's your eval plan as a .docx — share it with your team. Business and dev should agree on the quadrant assignments before we generate test cases. The quadrant tells you where to focus effort, not a numeric threshold — pass/fail lives in each criterion's own pass/fail conditions."

Step 10 — 🔍 Human Review checkpoints

Display before ending. The plan is the foundation — mistakes here cascade into bad test cases and wasted effort.

#	Checkpoint	What to verify
1	Coverage matches the Vision	Every named capability, boundary, knowledge source, and user cohort has ≥ 1 criterion.
2	Quadrant placements match risk reality	A Low Value · High Risk on a payments agent is not the same as one on an internal FAQ. Sense-check with the security/compliance partner.
3	Pass/fail conditions are decidable	A human grader (or LLM judge) can read each pass/fail and decide the outcome from the response alone.
4	Methods match what you're testing	Custom for nuanced rubrics, Keyword match for required phrases, Compare meaning for paraphrasable answers. Wrong method = wrong signal.
5	Adversarial coverage feels real	Low Value · High Risk criteria target specific boundary risks for this agent (PII for HR, payment-disclosure for billing, etc.) — not generic prompt-injection boilerplate.
6	Quality dimensions consolidated	4–6 dimensions, not 12. "Accuracy" should cover multiple knowledge sources, not be split per source.

Mandatory reminder: "This eval plan was AI-generated from your agent description / Vision. Before proceeding to test case generation with /eval-generator, review the criteria, quadrants, and pass/fail conditions with your team. The plan should reflect your business reality, not best-practice defaults."

Behavior rules

Every criterion must start with "The agent should…" (or "…should NOT…" for negative tests). Behaviors, not goals.
Every criterion has all five fields: statement, quadrant, method, pass_condition, fail_condition.
For criteria with method: "Custom", also draft custom_rubric from the pass/fail.
10–15 criteria total. Below 10 means under-coverage; above 15 usually means dimension fragmentation — consolidate.
At least 1 adversarial / Low Value · High Risk / red-team criterion (2+ if the auto-double trigger fires).
Don't prescribe percentage pass-thresholds per criterion. Pass/fail per case lives in the conditions.
If the description is vague, state assumptions explicitly in the Vision summary at the top of the report.

Example invocations

/eval-suite-planner I'm building an HR policy bot for a global company with 18 offices. It answers PTO, parental-leave, benefits questions from official HR documents. Should refuse salary-disclosure questions and escalate legal/discrimination concerns.

/eval-suite-planner Customer support agent for refund requests. Polite, follows refund policy, doesn't make promises beyond policy. Risk profile: HIGH (handles financial decisions).

/eval-suite-planner Email triage agent that reads incoming emails and labels them urgent / not-urgent / spam. Must NOT label real customer emails as spam.

/eval-suite-planner I have a Vision doc — purpose: code review for Python PRs, users: dev team, knowledge: PEP 8 + internal style guide, boundaries: no security review, success: PRs land faster with fewer style nits.

Companion skills

/eval-generator — Stage 2: takes this plan and produces concrete test cases (single CSV per quality signal, 3 columns, one row per case × method).
/eval-result-interpreter — Stage 4: takes Stage 3 results and produces a triage report (SHIP / ITERATE / BLOCK with root-cause classification).
/eval-faq — methodology Q&A grounded in Microsoft's eval ecosystem.
/eval-guide — the orchestrator. Wraps Stages 0–4 with an interactive dashboard checkpoint at each stage.

eval-suite-plannerby microsoft

Purpose

Instructions

Step 1 — Determine eval depth from architecture

Step 2 — Acceptance criteria on the Value × Risk matrix

Step 3 — Distribution sanity-check (reference, not gate)

Step 4 — Adversarial coverage minimums (auto-applied)

Step 5 — Quality dimensions (consolidate to 4–6)

Step 6 — Test methods (per criterion)

Step 7 — Pass/fail conditions per criterion

Step 8 — Coverage check against the Vision

Step 9 — Output: customer-ready `.docx` eval plan

Step 10 — 🔍 Human Review checkpoints

Behavior rules

Example invocations

Companion skills

More skills from microsoft

NotebookLM Web Importer

eval-suite-plannerby microsoft

Purpose

Instructions

Step 1 — Determine eval depth from architecture

Step 2 — Acceptance criteria on the Value × Risk matrix

Step 3 — Distribution sanity-check (reference, not gate)

Step 4 — Adversarial coverage minimums (auto-applied)

Step 5 — Quality dimensions (consolidate to 4–6)

Step 6 — Test methods (per criterion)

Step 7 — Pass/fail conditions per criterion

Step 8 — Coverage check against the Vision

Step 9 — Output: customer-ready .docx eval plan

Step 10 — 🔍 Human Review checkpoints

Behavior rules

Example invocations

Companion skills

More skills from microsoft

NotebookLM Web Importer

Step 9 — Output: customer-ready `.docx` eval plan