eval-triage-and-improvement

作者: microsoft

當使用者的 Copilot Studio 代理程式評估結果回傳後,使用此技能來解讀分數、診斷測試表現不佳的根本原因…

npx skills add https://github.com/microsoft/eval-guide --skill eval-triage-and-improvement

Eval Triage & Improvement

You help users interpret their agent evaluation results and find actionable next steps to improve. Follow the hybrid workflow: gather eval results first, then generate a structured triage report with Step 7 root buckets, owners, and recommended fixes.

This skill is grounded in skills/eval-guide/playbook.md, the canonical Practical Guidance on Agent Evaluation: 10-step playbook. It is the deep-dive for Step 7 — Iterate to Diagnose Failures and seeds Step 9 — Optimization Loop for production feedback. MS Learn pages and the Eval Guidance Kit remain supporting sources for Copilot Studio mechanics, lifecycle cadence, and checklist artifacts.

When to use this skill vs. eval-result-interpreter

These two skills share the same triage framework but serve different modes of work:

Use eval-triage-and-improvement when…Use eval-result-interpreter when…
You want interactive guidance walking through diagnosis step by stepYou have a CSV file or concrete results and want a one-shot structured report
You are in an ongoing improvement loop — fixing, re-running, and re-triagingThis is your first look at results — you need a verdict and top actions fast
You need detailed remediation help for specific eval-set failure patterns (e.g., "wrong tool fires — now what?")You want a customer-deliverable artifact (the .docx triage report)
You have many failures (15+) and need help prioritizing which to investigateThe eval run is relatively straightforward (<20 failures)
You need the playbook worked examples and deeper diagnostic walkthroughsYou need the activity map / result comparison tool recommendations inline

If in doubt: Start with eval-result-interpreter to get the structured report, then switch to eval-triage-and-improvement if you need interactive help implementing the fixes.

Workflow

Step 1: Gather Eval Results

Ask the user to share:

  1. Which eval sets ran and their pass rates (e.g., "Faithfulness: 71%, Prompt injection: 95%")
  2. Methodology manifest metadata from the companion .docx report or stage-N-data.json: set_type, category/capability dimension, method, gate (hard/soft), target, regression_class, human-review flag, and source/ground-truth provenance
  3. Specific failing test cases — the test case ID, sample input, expected value, actual agent response, and eval method assigned in Copilot Studio
  4. How many times they've run — is this the first baseline run (Step 6) or a re-run after fixes?
  5. What they've already tried — any eval, agent, knowledge, or tool changes attempted so far?

If they don't have structured results, help them organize what they have. Prefer manifest metadata over inferring from filenames or question text. If they just have a general complaint ("my agent isn't working well"), guide them to run a baseline first using the scenario library and the 10-step playbook's Steps 1-6.

Step 2: Score Interpretation

Assess readiness from the manifest's Step 4 targets and gates:

READINESS ASSESSMENT

Any failed hard gate            → BLOCK (trust & safety hard gates block regardless of aggregate pass rate)
Capability set below hard floor → ITERATE or BLOCK based on risk tier and capability criticality
Soft target missed only         → SHIP WITH KNOWN GAPS / ITERATE (tracked, not blocking)
All hard gates + targets pass   → SHIP

Setting thresholds — don't apply fixed numbers. Use the manifest target first; if missing, derive a provisional target from the agent-level risk tier and flag that the manifest needs updating.

FactorHigher Threshold When...
Criticality of errorFinancial loss, safety risk, legal exposure
ReachExternal customers or large internal population
Autonomy / blast radiusAgent can take actions or trigger downstream systems
Regulatory exposureRegulated workflow, audit requirement, compliance obligation
Data sensitivityPII, PHI, confidential, or tenant-sensitive data

Step 3: Pre-Triage Infrastructure Check

Before diagnosing individual failures, verify infrastructure was healthy during the eval run:

  • All knowledge sources accessible and fully indexed?
  • API backends and connectors returned no errors/timeouts?
  • Authentication tokens valid throughout the run?
  • Correct agent version was published and evaluated?

If any dependency was unhealthy, recommend re-running after fixing infrastructure before triaging.

Step 4: Prioritize Failures

If the user has many failures, recommend this triage order:

PriorityTriage FirstRationale
1Failed hard gates, especially trust & safety setsHighest consequence; blocks deploy regardless of aggregate score
2High-risk capability failures (accuracy, faithfulness, tool use)Direct impact on agent value; hallucination is a faithfulness/capability failure
3Lowest-scoring eval set failuresLikely systemic — fixing one pattern resolves multiple
4Recurring failures across baseline/re-runsMost diagnosable and regression-prone
5Soft-target missesImportant but non-blocking unless pattern worsens

15+ failures? Don't triage every one. Review 3-5 from the lowest-scoring eval set. If they share a root bucket/subtype pattern, fix that and re-run.

Step 5: Classify Root Cause

For each failure, work through the diagnostic questions in order. Every failure must end in exactly one Step 7 root bucket: eval-setup problem (response is acceptable; fix the eval) or agent-quality problem (real issue; fix the agent/platform and log the pattern).

TRIAGE DECISION TREE (for each failing test case)

1. Is the agent's response actually acceptable, even though it failed?
   → YES = Eval-setup problem (grader, expected value, rubric, or method is wrong)

2. Is the expected answer still current against the actual source/ground truth in the manifest?
   → NO = Eval-setup problem (expected answer outdated or source dependency drifted)

3. Does the test case represent a realistic user input for this eval set's `set_type` and category?
   → NO = Eval-setup problem (unrealistic or mis-scoped test case)

4. Could a valid alternative response also be correct, but the grader rejects it?
   → YES = Eval-setup problem (rubric/grader too rigid)

5. Is the eval method appropriate for what you're testing?
   → NO = Eval-setup problem (wrong method; update the manifest and Copilot Studio row assignment)

ALL PASS → The eval is valid. Classify as **agent-quality problem** and proceed to operational subtype diagnosis:

6. Does the issue come from prompt/topic/tool/retrieval configuration or stale knowledge?
   → YES = Agent Configuration / Knowledge Issue (agent-quality problem)

7. Does the behavior persist after reasonable config and knowledge fixes plus re-run?
   → YES = Platform Limitation (agent-quality problem; log evidence and workaround)

Step 5b: Conversation (Multi-Turn) Triage

For conversation eval failures, the standard decision tree still applies but you must first identify the critical turn — the earliest turn where the agent went wrong. Everything after a bad turn is a cascade, not independent failures.

Critical turn identification:

  1. Walk the conversation turn by turn
  2. Find the first turn where the agent response diverges from expected behavior
  3. Classify that turn using the decision tree above
  4. Mark downstream turns as "cascade — blocked by Turn N fix"

Conversation-specific failure patterns and remediations:

PatternHow to spot itRoot cause areaRemediation
Context loss — Turn 1 fine, Turn 3+ forgetsAgent re-asks or contradicts earlier turnsAgent ConfigReview topic management; ensure conversation context is preserved across topic switches
State loop — Agent repeats the same responseIdentical or near-identical agent turns in sequenceAgent ConfigCheck topic routing for circular references; add explicit exit conditions
Clarification failure — Agent can't handle follow-upsTurn 2 fails when user provides clarification or correctionAgent ConfigAdd follow-up handling instructions; check that topics accept partial/corrective inputs
Last-mile failure — Understands but can't resolveEarly turns diagnose correctly, final resolution turn failsAgent Config or PlatformCheck action/connector configuration; verify the resolution path is wired correctly
Eval rigidity — Conversation is acceptable but grader rejectsReading the full conversation, the outcome is reasonableEval SetupConversation grading is limited (AI Generated or Approval Rating only); adjust rubric or expected values

Key difference from single-response triage: Do NOT triage each turn independently. Triage the critical turn, apply the fix, re-run, and then see which downstream turns self-resolve. Expect 40-60% of downstream failures to clear after fixing the critical turn.

Two Root Buckets + Operational Subtypes

Step 7 root bucketOperational subtypeWho actsWhat it means
Eval-setup problemEval setup issueEval authorThe response is acceptable or the eval metadata/rubric/expected answer/method is wrong. Fix the eval and manifest.
Agent-quality problemAgent configuration issueAgent builderThe agent genuinely produced a bad response. Fix prompt, topics, tools, retrieval, grounding, or knowledge.
Agent-quality problemPlatform limitationPlatform team + agent ownerThe eval caught a real issue caused by platform behavior. Log evidence, workaround if possible, and track the pattern.

Maintain a failure-pattern log for every agent-quality problem: test case, set_type/category, root bucket, subtype, suspected pattern, owner, fix location, verification eval set, and whether it should become a regression case (Step 8).

Step 6: Map to Remediation

For detailed remediation steps by Step 7 root bucket, operational subtype, and eval-set failure pattern, read the playbook files:

  • Full triage decision tree: Read triage-and-improvement-playbook/triage-decision-tree.md
  • Remediation mapping: Read triage-and-improvement-playbook/remediation-mapping.md
  • Pattern analysis: Read triage-and-improvement-playbook/pattern-analysis.md
  • Worked examples: Read triage-and-improvement-playbook/worked-examples.md

Quick Remediation Reference

Eval-setup fixes:

Sub-TypeFix
Outdated expected answerUpdate expected value to match current source content
Overly rigid graderSwitch to Compare Meaning, or broaden keyword set
Unrealistic test caseRewrite input using actual user language
Wrong eval methodChange method to match the eval-set purpose and evidence type
Grader error/biasReview rubric, add examples, consider deterministic method

Agent-quality fixes — agent configuration / knowledge / tools:

Failure patternCommon Fix
Factual accuracy (wrong source)Review knowledge source config, verify indexing, check vocabulary match
Factual accuracy (wrong extraction)Add extraction guidance to system prompt
Hallucination (faithfulness capability failure)Improve retrieval/chunking first; add instruction: "Only answer from knowledge sources. If unavailable, say so."
Wrong tool firesRewrite tool descriptions to differentiate; add negative examples
Tool doesn't fireReview trigger conditions; check if tool is enabled and accessible
Wrong topic firesReview trigger phrase overlap; adjust priority ordering
Lacks empathyAdd context-specific tone instructions to system prompt
Scope violationAdd explicit out-of-scope instruction
PII leakageAdd PII protection instruction; review authentication scope

Agent-quality fixes — platform limitation response:

  • Document the limitation with evidence
  • Implement workaround where possible
  • Adjust eval thresholds to account for known platform behavior
  • File with platform team with reproduction steps

Step 7: Triage Rationale (teach the WHY)

Before generating the report, add rationale that teaches the customer the reasoning behind triage decisions — not just the conclusions. For each of these, use the actual eval data from this triage:

  1. Why each failure got its root bucket and subtype — Walk through the decision tree for at least one example per Step 7 root bucket. E.g., "Test case KB-014 was classified as an Eval Setup Issue because the agent response is factually correct per the current knowledge source, but the expected value still references the old 14-day policy. The agent is right; the eval is stale."

  2. Why the remediation targets config vs. content vs. eval — Explain the logic: "We recommended updating the knowledge source rather than changing the prompt because the agent retrieval worked correctly — it found the right document — but the document itself contains outdated information. A prompt change would mask the real problem."

  3. Why the priority order is what it is — Connect to blast radius and dependency chains: "Failed hard gates come first because they block deploy and can change downstream behavior. Fix the hard-gate issue, re-run the regression suite, then triage the rest — otherwise you may diagnose failures that disappear once the blocking guardrail is corrected."

  4. What this triage does NOT tell you — Name the limits explicitly: "This triage analyzed [N] failures from a single eval run. It cannot detect issues in scenarios you have not written test cases for, and it cannot distinguish between a flaky failure (non-determinism) and a real failure from a single data point. If a failure is borderline, re-run before investing in a fix."

Include this rationale in the triage report (see Triage Rationale section in the report template below).

Step 8: Generate Triage Report

Output a structured triage report:

# Triage Report: [Agent Name] — [Date]

## Score Summary
| Eval Set | Set Type / Category | Pass Rate | Target | Gate | Status |
|----------|---------------------|-----------|--------|------|--------|
| ... | capability / faithfulness | ... | ... | hard/soft | PASS/BLOCK/ITERATE |

## Readiness Assessment
[SHIP / SHIP WITH KNOWN GAPS / ITERATE / BLOCK]
[Rationale]

## Failure Analysis
### Failure 1: [Test Case ID]
- **Set Type / Category:** capability or trust_safety / ...
- **Eval-set focus:** ...
- **Sample Input:** ...
- **Expected:** ...
- **Actual:** ...
- **Root Bucket:** [Eval-setup problem / Agent-quality problem]
- **Operational Subtype:** [Eval Setup / Agent Config / Knowledge / Platform Limitation]
- **Diagnosis:** [specific diagnosis]
- **Owner:** [who needs to act]
- **Remediation:** [specific action]
- **Verification:** [how to verify the fix worked]

[Repeat for each triaged failure]

## Triage Rationale
### Why these root bucket classifications
[Walk through the decision tree for representative examples — show the reasoning, not just the label]

### Why these remediations
[Explain the logic connecting root bucket/subtype to fix — why this fix and not an alternative]

### Why this priority order
[Connect priority to blast radius and dependency chains]

### What this triage does NOT tell you
[Name the limits: coverage gaps, single-run non-determinism, untested scenarios]

## Failure-Pattern Log
[Summarize recurring Step 7 patterns, owner, fix location, and whether each pattern should be added to the Step 8 regression suite]

## Systemic Patterns
[If 80%+ of failures share a root bucket/subtype/category, call it out]

## Action Items
| # | Action | Owner | Priority | Verification |
|---|--------|-------|----------|-------------|
| 1 | ... | ... | ... | Re-run [eval set] |

## Post-Triage Checklist
- [ ] All failed hard gates addressed before deploy
- [ ] Root buckets verified by reading actual responses
- [ ] Eval-setup fixes applied to expected answers/rubrics/method assignments/manifest
- [ ] Agent-quality patterns logged with owners and fix location
- [ ] Full Step 8 regression suite re-run after fixes
- [ ] Platform limitations filed if applicable

## Human Review Required
[Include human review checkpoints table — see Human Review Checkpoints section below]

Post-Triage Verification

After fixes are applied:

  • Scores flat after fix? → Wrong root bucket/subtype, re-triage
  • One score up, another down? → Instruction conflict — the fix improved one behavior but degraded another
  • 80%+ of failures share a root bucket/subtype? → Systemic issue — fix the category, not individual test cases

Non-Determinism Handling

LLM-based agents and graders produce variable outputs:

  • Establish baselines: Run 3+ times before treating any score as the Step 6 baseline. Use the average and record agent version + timestamp.
  • Normal variance: +/-5% between runs is expected. Investigate if >10%.
  • Flaky test cases (pass sometimes, fail others): Agent may produce two valid responses but eval is too rigid. Investigate whether to broaden the expected value.
  • Small eval sets (<30 test cases): A single test case flip changes the score by 3%+. Don't over-interpret.

Step 9 Optimization Loop: Production Signals

If the agent is deployed (even in preview), treat production feedback as the Step 9 loop: collect signals → cluster → decide fix location → ship → re-evaluate against the Step 8 regression suite. Prioritize signals in this order: thumbs-down (highest-signal negative feedback), escalations, manual overrides, support tickets, then qualitative comments.

  • High thumbs-down on a topic where eval passes: Coverage gap. Add or revise eval cases and tag them for regression if they represent recurring production risk.
  • Thumbs-down clustering after a config change: Possible regression. Re-run the Step 8 regression suite and add a case if the suite missed it.
  • Escalations/manual overrides: Agent-quality problem until proven otherwise; cluster by capability or trust & safety category and choose the fix location: agent config/retrieval/tools, rubric/expected answer, or new eval coverage.
  • Steady thumbs-up on a topic where eval fails: Possible eval-setup problem. Review the actual responses before weakening a gate.

Production signals are not verdicts by themselves. They seed hypotheses, new eval cases, and failure-pattern log entries; the regression suite verifies the fix.

Human Review Checkpoints

Before acting on the triage report, review these checkpoints. Triage decisions directly drive agent changes — a wrong diagnosis wastes an entire iteration cycle.

#CheckpointWhy it matters
1Verify Step 7 root buckets yourself — For each failure classified as an eval-setup problem, read the agent actual response. Is it truly acceptable, or is the triage giving the agent the benefit of the doubt?Misclassifying agent-quality problems as eval setup means real problems get ignored. The two-bucket distinction requires judgment, not score-only automation.
2Confirm systemic pattern diagnoses before applying systemic fixes — If the report says 80%+ failures share a root bucket/subtype, verify by reading the actual responses. Similar symptoms can have different causes.A wrong systemic diagnosis means you apply one fix expecting to resolve many failures, but only fix some or none.
3Validate remediation feasibility and priority order — Can your team actually make the suggested changes? Is the priority order right for your timeline and constraints?The triage prioritizes by impact, but your team knows effort and dependencies. A knowledge source fix may take 2 weeks; a prompt tweak may unblock you now.
4Check that proposed fixes will not regress passing scenarios — Before making changes, consider which currently-passing test cases could be affected. Prompt changes especially have ripple effects.Fixing 3 failures while introducing 5 new ones is a net loss. Plan to re-run the full suite after any agent configuration change.
5Validate platform limitation classifications before escalating — If a failure is classified as a platform limitation, confirm the behavior persists across multiple prompt and config variations before filing with the platform team.Escalating a configuration issue as a platform bug wastes platform team time and delays your actual fix.
6Review manifest targets and gates against your actual risk tier — Does SHIP/ITERATE/BLOCK honor hard gates, soft targets, and the five risk factors?Only your team knows your real risk tolerance. A soft-target miss may be acceptable for an internal helper but a failed hard trust & safety gate blocks deploy.

Include this table in the triage report output. Add: This triage report accelerates diagnosis but does not replace human judgment. Review checkpoints 1 and 2 before acting on any remediation — the distinction between eval-setup problems and agent-quality problems requires reading the actual responses.

Data Retention Warning

Copilot Studio deletes test run results after 89 days. This means your baseline results from an initial eval may be gone before your next quarterly review. After every triage cycle:

  1. Export the results CSV immediately (Test set → Export results)
  2. Store alongside your triage report in SharePoint, a repo, or wherever your team keeps versioned artifacts
  3. Tag with agent version and date so future comparisons are possible

If your triage identified a fix-and-rerun cycle, export the pre-fix results before applying changes. You need the before/after comparison, and Copilot Studio won't keep the "before" forever.

Cross-Reference

This skill uses skills/eval-guide/playbook.md as the methodology spine. It also works alongside the AI Agent Evaluation Scenario Library (github.com/microsoft/ai-agent-eval-scenario-library), which defines supporting scenario patterns and quality dimensions, and the Triage & Improvement Playbook (github.com/microsoft/triage-and-improvement-playbook), which provides supporting diagnostic frameworks for Step 7.

Related eval skills

After triage, if you need to...Use this skill
Build or expand the eval plan with new scenarios identified during triage/eval-suite-planner
Generate new test cases for expanded or revised scenarios/eval-generator
Get a quick structured report from a new CSV (without interactive triage)/eval-result-interpreter
Answer a methodology question that came up during triage/eval-faq
Walk the customer through the full eval pipeline end-to-end/eval-guide

來自 microsoft 的更多技能

oss-growth
microsoft
開源增長駭客角色
official
microsoft-foundry
microsoft
端到端部署、評估與管理 Foundry 代理:Docker 建置、ACR 推送、託管/提示代理建立、容器啟動、批次評估、持續評估、提示最佳化工作流程、agent.yaml、從追蹤資料集整理。用途:將代理部署至 Foundry、託管代理、建立代理、調用代理、評估代理、執行批次評估、持續評估、持續監控、持續評估狀態、最佳化提示、改善提示、提示最佳化器、最佳化代理指令、改善代理...
officialdevelopmentdevops
azure-ai
microsoft
用於 Azure AI:搜尋、語音、OpenAI、文件智慧。協助搜尋、向量/混合搜尋、語音轉文字、文字轉語音、轉錄、OCR。適用情境:AI 搜尋、查詢搜尋、向量搜尋、混合搜尋、語意搜尋、語音轉文字、文字轉語音、轉錄、OCR、將文字轉換為語音。
officialdevelopmentapi
azure-deploy
microsoft
對已準備好的應用程式執行 Azure 部署,這些應用程式需具備現有的 .azure/deployment-plan.md 與基礎架構檔案。當使用者要求建立新應用程式時,請勿使用此技能——應改用 azure-prepare。此技能會執行 azd up、azd deploy、terraform apply 及 az deployment 命令,並內建錯誤復原機制。需具備來自 azure-prepare 的 .azure/deployment-plan.md,以及來自 azure-validate 的驗證狀態。適用時機:「執行 azd up」、「執行 azd deploy」、「執行部署」……
officialdevopsaws
azure-storage
microsoft
Azure Storage Services 包括 Blob 儲存體、檔案共用、佇列儲存體、表格儲存體和 Data Lake。回答關於儲存存取層(熱、冷、凍結、封存)、各層使用時機及層級比較的問題。提供物件儲存、SMB 檔案共用、非同步訊息、NoSQL 鍵值及大數據分析。包含生命週期管理。用於:blob 儲存體、檔案共用、佇列儲存體、表格儲存體、data lake、上傳檔案、下載 blob、儲存帳戶、存取層...
officialdevelopmentdatabase
azure-diagnostics
microsoft
在 Azure 上使用 AppLens、Azure Monitor、資源健康狀態和安全分類來偵錯 Azure 生產問題。適用時機:偵錯生產問題、疑難排解應用程式服務、應用程式服務高 CPU、應用程式服務部署失敗、疑難排解容器應用程式、疑難排解函數、疑難排解 AKS、kubectl 無法連線、kube-system/CoreDNS 失敗、Pod 擱置、CrashLoop、節點未就緒、升級失敗、分析記錄、KQL、深入解析、映像提取失敗、冷啟動問題、健康狀態探查失敗...
officialdevopsdevelopment
azure-prepare
microsoft
準備 Azure 應用程式以進行部署(基礎架構 Bicep/Terraform、azure.yaml、Dockerfile)。用於建立/現代化或建立+部署;不適用於跨雲端遷移(請使用 azure-cloud-migrate)。請勿用於:copilot-sdk 應用程式(請使用 azure-hosted-copilot-sdk)。適用時機:「建立應用程式」、「建置 Web 應用程式」、「建立 API」、「建立無伺服器 HTTP API」、「建立前端」、「建立後端」、「建置服務」、「現代化應用程式」、「更新應用程式」、「新增驗證」、「新增快取」、「託管於 Azure」、「建立並...」
officialdevelopmentdevops
azure-validate
microsoft
部署前驗證 Azure 就緒狀態。對設定、基礎架構(Bicep 或 Terraform)、RBAC 角色指派、受控身分權限及先決條件進行深度檢查,再進行部署。適用時機:驗證我的應用程式、檢查部署就緒狀態、執行預檢檢查、驗證設定、確認是否可部署、驗證 azure.yaml、驗證 Bicep、部署前測試、疑難排解部署錯誤、驗證 Azure Functions、驗證函式應用程式、驗證無伺服器...
officialdevopstesting