phoenix-evalspar github

Build and run evaluators for AI/LLM applications using Phoenix.

npx skills add https://github.com/github/awesome-copilot --skill phoenix-evals

Phoenix Evals

Build evaluators for AI/LLM applications. Code first, LLM for nuance, validate against humans.

Quick Reference

TaskFiles
Setupsetup-python, setup-typescript
Decide what to evaluateevaluators-overview
Choose a judge modelfundamentals-model-selection
Use pre-built evaluatorsevaluators-pre-built
Build code evaluatorevaluators-code-python, evaluators-code-typescript
Build LLM evaluatorevaluators-llm-python, evaluators-llm-typescript, evaluators-custom-templates
Batch evaluate DataFrameevaluate-dataframe-python
Run experimentexperiments-running-python, experiments-running-typescript
Create datasetexperiments-datasets-python, experiments-datasets-typescript
Generate synthetic dataexperiments-synthetic-python, experiments-synthetic-typescript
Validate evaluator accuracyvalidation, validation-evaluators-python, validation-evaluators-typescript
Sample traces for reviewobserve-sampling-python, observe-sampling-typescript
Analyze errorserror-analysis, error-analysis-multi-turn, axial-coding
RAG evalsevaluators-rag
Avoid common mistakescommon-mistakes-python, fundamentals-anti-patterns
Productionproduction-overview, production-guardrails, production-continuous

Workflows

Starting Fresh: observe-tracing-setuperror-analysisaxial-codingevaluators-overview

Building Evaluator: fundamentalscommon-mistakes-python → evaluators-{code|llm}-{python|typescript} → validation-evaluators-{python|typescript}

RAG Systems: evaluators-rag → evaluators-code-* (retrieval) → evaluators-llm-* (faithfulness)

Production: production-overviewproduction-guardrailsproduction-continuous

Reference Categories

PrefixDescription
fundamentals-*Types, scores, anti-patterns
observe-*Tracing, sampling
error-analysis-*Finding failures
axial-coding-*Categorizing failures
evaluators-*Code, LLM, RAG evaluators
experiments-*Datasets, running experiments
validation-*Validating evaluator accuracy against human labels
production-*CI/CD, monitoring

Key Principles

PrincipleAction
Error analysis firstCan't automate what you haven't observed
Custom > genericBuild from your failures
Code firstDeterministic before LLM
Validate judges>80% TPR/TNR
Binary > LikertPass/fail, not 1-5

Plus de skills de github

console-rendering
by github
Instructions for using the struct tag-based console rendering system in Go
acquire-codebase-knowledge
by github
Use this skill when the user explicitly asks to map, document, or onboard into an existing codebase. Trigger for prompts like "map this codebase", "document…
acreadiness-assess
by github
Run the AgentRC readiness assessment on the current repository and produce a static HTML dashboard at reports/index.html. Wraps `npx github:microsoft/agentrc…
acreadiness-generate-instructions
by github
Generate tailored AI agent instruction files via AgentRC instructions command. Produces .github/copilot-instructions.md (default, recommended for Copilot in VS…
acreadiness-policy
by github
Help the user pick, write, or apply an AgentRC policy. Policies customise readiness scoring by disabling irrelevant checks, overriding impact/level, setting…
add-educational-comments
by github
Add educational comments to code files to transform them into effective learning resources. Adapts explanation depth and tone to three configurable knowledge levels: beginner, intermediate, and advanced Automatically requests a file if none is provided, with numbered list matching for quick selection Expands files by up to 125% using educational comments only (hard limit: 400 new lines; 300 for files over 1,000 lines) Preserves file encoding, indentation style, syntax correctness, and...
adobe-illustrator-scripting
by github
Write, debug, and optimize Adobe Illustrator automation scripts using ExtendScript (JavaScript/JSX). Use when creating or modifying scripts that manipulate…
agent-governance
by github
Declarative policies, intent classification, and audit trails for controlling AI agent tool access and behavior. Composable governance policies define allowed/blocked tools, content filters, rate limits, and approval requirements — stored as configuration, not code Semantic intent classification detects dangerous prompts (data exfiltration, privilege escalation, prompt injection) before tool execution using pattern-based signals Tool-level governance decorator enforces policies at function...

NotebookLM Web Importer

Importez des pages web et des vidéos YouTube dans NotebookLM en un clic. Utilisé par plus de 200 000 utilisateurs.

Installer l'extension Chrome