phoenix-evalsот github

Build and run evaluators for AI/LLM applications using Phoenix.

npx skills add https://github.com/github/awesome-copilot --skill phoenix-evals

Скачать ZIP GitHub

Phoenix Evals

Build evaluators for AI/LLM applications. Code first, LLM for nuance, validate against humans.

Quick Reference

Task	Files
Setup	setup-python, setup-typescript
Decide what to evaluate	evaluators-overview
Choose a judge model	fundamentals-model-selection
Use pre-built evaluators	evaluators-pre-built
Build code evaluator	evaluators-code-python, evaluators-code-typescript
Build LLM evaluator	evaluators-llm-python, evaluators-llm-typescript, evaluators-custom-templates
Batch evaluate DataFrame	evaluate-dataframe-python
Run experiment	experiments-running-python, experiments-running-typescript
Create dataset	experiments-datasets-python, experiments-datasets-typescript
Generate synthetic data	experiments-synthetic-python, experiments-synthetic-typescript
Validate evaluator accuracy	validation, validation-evaluators-python, validation-evaluators-typescript
Sample traces for review	observe-sampling-python, observe-sampling-typescript
Analyze errors	error-analysis, error-analysis-multi-turn, axial-coding
RAG evals	evaluators-rag
Avoid common mistakes	common-mistakes-python, fundamentals-anti-patterns
Production	production-overview, production-guardrails, production-continuous

Workflows

Starting Fresh: observe-tracing-setup → error-analysis → axial-coding → evaluators-overview

Building Evaluator: fundamentals → common-mistakes-python → evaluators-{code|llm}-{python|typescript} → validation-evaluators-{python|typescript}

RAG Systems: evaluators-rag → evaluators-code-* (retrieval) → evaluators-llm-* (faithfulness)

Production: production-overview → production-guardrails → production-continuous

Reference Categories

Prefix	Description
`fundamentals-*`	Types, scores, anti-patterns
`observe-*`	Tracing, sampling
`error-analysis-*`	Finding failures
`axial-coding-*`	Categorizing failures
`evaluators-*`	Code, LLM, RAG evaluators
`experiments-*`	Datasets, running experiments
`validation-*`	Validating evaluator accuracy against human labels
`production-*`	CI/CD, monitoring

Key Principles

Principle	Action
Error analysis first	Can't automate what you haven't observed
Custom > generic	Build from your failures
Code first	Deterministic before LLM
Validate judges	>80% TPR/TNR
Binary > Likert	Pass/fail, not 1-5

Больше skills от github

console-rendering

Instructions for using the struct tag-based console rendering system in Go

acquire-codebase-knowledge

Use this skill when the user explicitly asks to map, document, or onboard into an existing codebase. Trigger for prompts like "map this codebase", "document…

acreadiness-assess

Run the AgentRC readiness assessment on the current repository and produce a static HTML dashboard at reports/index.html. Wraps `npx github:microsoft/agentrc…

acreadiness-generate-instructions

Generate tailored AI agent instruction files via AgentRC instructions command. Produces .github/copilot-instructions.md (default, recommended for Copilot in VS…

acreadiness-policy

Help the user pick, write, or apply an AgentRC policy. Policies customise readiness scoring by disabling irrelevant checks, overriding impact/level, setting…

add-educational-comments

Add educational comments to code files to transform them into effective learning resources. Adapts explanation depth and tone to three configurable knowledge levels: beginner, intermediate, and advanced Automatically requests a file if none is provided, with numbered list matching for quick selection Expands files by up to 125% using educational comments only (hard limit: 400 new lines; 300 for files over 1,000 lines) Preserves file encoding, indentation style, syntax correctness, and...

adobe-illustrator-scripting

Write, debug, and optimize Adobe Illustrator automation scripts using ExtendScript (JavaScript/JSX). Use when creating or modifying scripts that manipulate…

agent-governance

Declarative policies, intent classification, and audit trails for controlling AI agent tool access and behavior. Composable governance policies define allowed/blocked tools, content filters, rate limits, and approval requirements — stored as configuration, not code Semantic intent classification detects dangerous prompts (data exfiltration, privilege escalation, prompt injection) before tool execution using pattern-based signals Tool-level governance decorator enforces policies at function...

NotebookLM Web Importer

Импортируйте веб-страницы и видео YouTube в NotebookLM одним кликом. Более 200 000 пользователей доверяют нам.

Установить расширение Chrome