agentic-eval

oleh github

Pola evaluasi dan penyempurnaan iteratif untuk meningkatkan keluaran agen AI melalui putaran kritik diri. Menyediakan tiga pola inti: refleksi dasar (putaran kritik diri), evaluator-optimizer (generasi dan evaluasi terpisah), serta penyempurnaan berbasis pengujian khusus kode. Mendukung berbagai strategi evaluasi termasuk penilaian berbasis hasil, perbandingan LLM-sebagai-hakim, dan penilaian berbasis rubrik dengan dimensi berbobot. Mencakup implementasi Python praktis dengan keluaran JSON terstruktur...

npx skills add https://github.com/github/awesome-copilot --skill agentic-eval

Agentic Evaluation Patterns

Patterns for self-improvement through iterative evaluation and refinement.

Overview

Evaluation patterns enable agents to assess and improve their own outputs, moving beyond single-shot generation to iterative refinement loops.

Generate → Evaluate → Critique → Refine → Output
    ↑                              │
    └──────────────────────────────┘

When to Use

  • Quality-critical generation: Code, reports, analysis requiring high accuracy
  • Tasks with clear evaluation criteria: Defined success metrics exist
  • Content requiring specific standards: Style guides, compliance, formatting

Pattern 1: Basic Reflection

Agent evaluates and improves its own output through self-critique.

def reflect_and_refine(task: str, criteria: list[str], max_iterations: int = 3) -> str:
    """Generate with reflection loop."""
    output = llm(f"Complete this task:\n{task}")
    
    for i in range(max_iterations):
        # Self-critique
        critique = llm(f"""
        Evaluate this output against criteria: {criteria}
        Output: {output}
        Rate each: PASS/FAIL with feedback as JSON.
        """)
        
        critique_data = json.loads(critique)
        all_pass = all(c["status"] == "PASS" for c in critique_data.values())
        if all_pass:
            return output
        
        # Refine based on critique
        failed = {k: v["feedback"] for k, v in critique_data.items() if v["status"] == "FAIL"}
        output = llm(f"Improve to address: {failed}\nOriginal: {output}")
    
    return output

Key insight: Use structured JSON output for reliable parsing of critique results.


Pattern 2: Evaluator-Optimizer

Separate generation and evaluation into distinct components for clearer responsibilities.

class EvaluatorOptimizer:
    def __init__(self, score_threshold: float = 0.8):
        self.score_threshold = score_threshold
    
    def generate(self, task: str) -> str:
        return llm(f"Complete: {task}")
    
    def evaluate(self, output: str, task: str) -> dict:
        return json.loads(llm(f"""
        Evaluate output for task: {task}
        Output: {output}
        Return JSON: {{"overall_score": 0-1, "dimensions": {{"accuracy": ..., "clarity": ...}}}}
        """))
    
    def optimize(self, output: str, feedback: dict) -> str:
        return llm(f"Improve based on feedback: {feedback}\nOutput: {output}")
    
    def run(self, task: str, max_iterations: int = 3) -> str:
        output = self.generate(task)
        for _ in range(max_iterations):
            evaluation = self.evaluate(output, task)
            if evaluation["overall_score"] >= self.score_threshold:
                break
            output = self.optimize(output, evaluation)
        return output

Pattern 3: Code-Specific Reflection

Test-driven refinement loop for code generation.

class CodeReflector:
    def reflect_and_fix(self, spec: str, max_iterations: int = 3) -> str:
        code = llm(f"Write Python code for: {spec}")
        tests = llm(f"Generate pytest tests for: {spec}\nCode: {code}")
        
        for _ in range(max_iterations):
            result = run_tests(code, tests)
            if result["success"]:
                return code
            code = llm(f"Fix error: {result['error']}\nCode: {code}")
        return code

Evaluation Strategies

Outcome-Based

Evaluate whether output achieves the expected result.

def evaluate_outcome(task: str, output: str, expected: str) -> str:
    return llm(f"Does output achieve expected outcome? Task: {task}, Expected: {expected}, Output: {output}")

LLM-as-Judge

Use LLM to compare and rank outputs.

def llm_judge(output_a: str, output_b: str, criteria: str) -> str:
    return llm(f"Compare outputs A and B for {criteria}. Which is better and why?")

Rubric-Based

Score outputs against weighted dimensions.

RUBRIC = {
    "accuracy": {"weight": 0.4},
    "clarity": {"weight": 0.3},
    "completeness": {"weight": 0.3}
}

def evaluate_with_rubric(output: str, rubric: dict) -> float:
    scores = json.loads(llm(f"Rate 1-5 for each dimension: {list(rubric.keys())}\nOutput: {output}"))
    return sum(scores[d] * rubric[d]["weight"] for d in rubric) / 5

Best Practices

PracticeRationale
Clear criteriaDefine specific, measurable evaluation criteria upfront
Iteration limitsSet max iterations (3-5) to prevent infinite loops
Convergence checkStop if output score isn't improving between iterations
Log historyKeep full trajectory for debugging and analysis
Structured outputUse JSON for reliable parsing of evaluation results

Quick Start Checklist

## Evaluation Implementation Checklist

### Setup
- [ ] Define evaluation criteria/rubric
- [ ] Set score threshold for "good enough"
- [ ] Configure max iterations (default: 3)

### Implementation
- [ ] Implement generate() function
- [ ] Implement evaluate() function with structured output
- [ ] Implement optimize() function
- [ ] Wire up the refinement loop

### Safety
- [ ] Add convergence detection
- [ ] Log all iterations for debugging
- [ ] Handle evaluation parse failures gracefully

Lebih banyak skill dari github

console-rendering
github
Instruksi untuk menggunakan sistem rendering konsol berbasis tag struct di Go
official
acquire-codebase-knowledge
github
Gunakan keterampilan ini ketika pengguna secara eksplisit meminta untuk memetakan, mendokumentasikan, atau mempelajari basis kode yang sudah ada. Aktifkan untuk perintah seperti "petakan basis kode ini", "dokumentasikan…
official
acreadiness-assess
github
Run the AgentRC readiness assessment on the current repository and produce a static HTML dashboard at reports/index.html. Wraps `npx github:microsoft/agentrc…
official
acreadiness-generate-instructions
github
Menghasilkan file instruksi agen AI yang disesuaikan melalui perintah instruksi AgentRC. Menghasilkan .github/copilot-instructions.md (default, direkomendasikan untuk Copilot di VS…
official
acreadiness-policy
github
Bantu pengguna memilih, menulis, atau menerapkan kebijakan AgentRC. Kebijakan menyesuaikan penilaian kesiapan dengan menonaktifkan pemeriksaan yang tidak relevan, mengganti dampak/tingkat, mengatur…
official
add-educational-comments
github
Tambahkan komentar edukatif ke file kode untuk mengubahnya menjadi sumber belajar yang efektif. Menyesuaikan kedalaman penjelasan dan nada dengan tiga tingkat pengetahuan yang dapat dikonfigurasi: pemula, menengah, dan mahir. Secara otomatis meminta file jika tidak ada yang disediakan, dengan pencocokan daftar bernomor untuk pemilihan cepat. Memperluas file hingga 125% hanya menggunakan komentar edukatif (batas keras: 400 baris baru; 300 untuk file di atas 1.000 baris). Mempertahankan encoding file, gaya indentasi, kebenaran sintaks, dan...
official
adobe-illustrator-scripting
github
Menulis, men-debug, dan mengoptimalkan skrip otomatisasi Adobe Illustrator menggunakan ExtendScript (JavaScript/JSX). Gunakan saat membuat atau memodifikasi skrip yang memanipulasi…
official
agent-governance
github
Kebijakan deklaratif, klasifikasi intensi, dan jejak audit untuk mengontrol akses dan perilaku alat agen AI. Kebijakan tata kelola yang dapat dikomposisikan mendefinisikan alat yang diizinkan/diblokir, filter konten, batas kecepatan, dan persyaratan persetujuan — disimpan sebagai konfigurasi, bukan kode. Klasifikasi intensi semantik mendeteksi perintah berbahaya (eksfiltrasi data, eskalasi hak istimewa, injeksi perintah) sebelum eksekusi alat menggunakan sinyal berbasis pola. Dekorator tata kelola tingkat alat memberlakukan kebijakan pada fungsi...
official