agentic-eval

作者: github

通过自我批评循环迭代评估和优化AI智能体输出的模式。提供三种核心模式:基础反思(自我批评循环)、评估器-优化器(分离生成与评估)以及代码特定的测试驱动优化。支持多种评估策略,包括基于结果的评估、LLM作为评判者的比较,以及带加权维度的基于评分标准的打分。包含实用的Python实现及结构化JSON输出...

npx skills add https://github.com/github/awesome-copilot --skill agentic-eval

Agentic Evaluation Patterns

Patterns for self-improvement through iterative evaluation and refinement.

Overview

Evaluation patterns enable agents to assess and improve their own outputs, moving beyond single-shot generation to iterative refinement loops.

Generate → Evaluate → Critique → Refine → Output
    ↑                              │
    └──────────────────────────────┘

When to Use

  • Quality-critical generation: Code, reports, analysis requiring high accuracy
  • Tasks with clear evaluation criteria: Defined success metrics exist
  • Content requiring specific standards: Style guides, compliance, formatting

Pattern 1: Basic Reflection

Agent evaluates and improves its own output through self-critique.

def reflect_and_refine(task: str, criteria: list[str], max_iterations: int = 3) -> str:
    """Generate with reflection loop."""
    output = llm(f"Complete this task:\n{task}")
    
    for i in range(max_iterations):
        # Self-critique
        critique = llm(f"""
        Evaluate this output against criteria: {criteria}
        Output: {output}
        Rate each: PASS/FAIL with feedback as JSON.
        """)
        
        critique_data = json.loads(critique)
        all_pass = all(c["status"] == "PASS" for c in critique_data.values())
        if all_pass:
            return output
        
        # Refine based on critique
        failed = {k: v["feedback"] for k, v in critique_data.items() if v["status"] == "FAIL"}
        output = llm(f"Improve to address: {failed}\nOriginal: {output}")
    
    return output

Key insight: Use structured JSON output for reliable parsing of critique results.


Pattern 2: Evaluator-Optimizer

Separate generation and evaluation into distinct components for clearer responsibilities.

class EvaluatorOptimizer:
    def __init__(self, score_threshold: float = 0.8):
        self.score_threshold = score_threshold
    
    def generate(self, task: str) -> str:
        return llm(f"Complete: {task}")
    
    def evaluate(self, output: str, task: str) -> dict:
        return json.loads(llm(f"""
        Evaluate output for task: {task}
        Output: {output}
        Return JSON: {{"overall_score": 0-1, "dimensions": {{"accuracy": ..., "clarity": ...}}}}
        """))
    
    def optimize(self, output: str, feedback: dict) -> str:
        return llm(f"Improve based on feedback: {feedback}\nOutput: {output}")
    
    def run(self, task: str, max_iterations: int = 3) -> str:
        output = self.generate(task)
        for _ in range(max_iterations):
            evaluation = self.evaluate(output, task)
            if evaluation["overall_score"] >= self.score_threshold:
                break
            output = self.optimize(output, evaluation)
        return output

Pattern 3: Code-Specific Reflection

Test-driven refinement loop for code generation.

class CodeReflector:
    def reflect_and_fix(self, spec: str, max_iterations: int = 3) -> str:
        code = llm(f"Write Python code for: {spec}")
        tests = llm(f"Generate pytest tests for: {spec}\nCode: {code}")
        
        for _ in range(max_iterations):
            result = run_tests(code, tests)
            if result["success"]:
                return code
            code = llm(f"Fix error: {result['error']}\nCode: {code}")
        return code

Evaluation Strategies

Outcome-Based

Evaluate whether output achieves the expected result.

def evaluate_outcome(task: str, output: str, expected: str) -> str:
    return llm(f"Does output achieve expected outcome? Task: {task}, Expected: {expected}, Output: {output}")

LLM-as-Judge

Use LLM to compare and rank outputs.

def llm_judge(output_a: str, output_b: str, criteria: str) -> str:
    return llm(f"Compare outputs A and B for {criteria}. Which is better and why?")

Rubric-Based

Score outputs against weighted dimensions.

RUBRIC = {
    "accuracy": {"weight": 0.4},
    "clarity": {"weight": 0.3},
    "completeness": {"weight": 0.3}
}

def evaluate_with_rubric(output: str, rubric: dict) -> float:
    scores = json.loads(llm(f"Rate 1-5 for each dimension: {list(rubric.keys())}\nOutput: {output}"))
    return sum(scores[d] * rubric[d]["weight"] for d in rubric) / 5

Best Practices

PracticeRationale
Clear criteriaDefine specific, measurable evaluation criteria upfront
Iteration limitsSet max iterations (3-5) to prevent infinite loops
Convergence checkStop if output score isn't improving between iterations
Log historyKeep full trajectory for debugging and analysis
Structured outputUse JSON for reliable parsing of evaluation results

Quick Start Checklist

## Evaluation Implementation Checklist

### Setup
- [ ] Define evaluation criteria/rubric
- [ ] Set score threshold for "good enough"
- [ ] Configure max iterations (default: 3)

### Implementation
- [ ] Implement generate() function
- [ ] Implement evaluate() function with structured output
- [ ] Implement optimize() function
- [ ] Wire up the refinement loop

### Safety
- [ ] Add convergence detection
- [ ] Log all iterations for debugging
- [ ] Handle evaluation parse failures gracefully

来自 github 的更多技能

console-rendering
github
在Go中使用基于结构体标签的控制台渲染系统的说明
official
acquire-codebase-knowledge
github
当用户明确要求映射、记录或熟悉现有代码库时使用此技能。触发词如“映射此代码库”、“记录…
official
acreadiness-assess
github
Run the AgentRC readiness assessment on the current repository and produce a static HTML dashboard at reports/index.html. Wraps `npx github:microsoft/agentrc…
official
acreadiness-generate-instructions
github
通过AgentRC指令命令生成定制化的AI代理指令文件。生成.github/copilot-instructions.md(默认,推荐用于VS Code中的Copilot…
official
acreadiness-policy
github
帮助用户选择、编写或应用AgentRC策略。策略通过禁用无关检查、覆盖影响/级别、设置…来定制就绪评分。
official
add-educational-comments
github
为代码文件添加教育性注释,将其转化为有效的学习资源。根据三个可配置的知识水平(初级、中级、高级)调整解释深度和语气。若未提供文件,自动请求文件,并附带编号列表以便快速选择。仅通过教育性注释将文件扩展最多125%(硬性限制:新增400行;超过1000行的文件限制为300行)。保留文件编码、缩进风格、语法正确性以及...
official
adobe-illustrator-scripting
github
使用ExtendScript(JavaScript/JSX)编写、调试和优化Adobe Illustrator自动化脚本。在创建或修改操作…的脚本时使用。
official
agent-governance
github
声明式策略、意图分类及审计追踪,用于控制AI代理工具访问与行为。可组合的治理策略定义允许/禁止的工具、内容过滤器、速率限制及审批要求——以配置而非代码形式存储。语义意图分类在执行工具前通过基于模式的信号检测危险提示(数据泄露、权限提升、提示注入)。工具级治理装饰器在函数层面强制执行策略...
official