omni-ai-eval
Évaluer la précision de la génération de requêtes Omni AI en exécutant des invites de test via l’interface en ligne de commande Omni, en comparant le JSON de requête généré aux résultats attendus, et en notant…
npx skills add https://github.com/exploreomni/omni-agent-skills --skill omni-ai-evalOmni AI Eval
Omni ships a first-class eval system (the AI Hub → Prompt sets and Eval runs). This skill drives it through the Omni CLI: define a reusable prompt set, start a judged eval run against a model or branch, and read per-prompt verdicts from Omni's built-in accuracy judge.
Prefer this native system over building your own harness. The judge scores each answer semantically against the full agent conversation — it does not require golden query JSON, and it evaluates the whole agentic workflow (topic selection, queries, results, and the final written answer), not just generated query structure.
Tip: Use
omni-ai-optimizerto improve scores after finding failures,omni-model-builderto apply context changes on a branch before A/B testing, andomni-model-explorerto discover topics and fields when writing prompts.
Prerequisites
# Verify the Omni CLI is installed — if not, ask the user to install it.
# See: https://github.com/exploreomni/cli#readme
command -v omni >/dev/null || echo "ERROR: Omni CLI is not installed."
# Verify the CLI has the eval commands. If this errors with "unknown command",
# the binary is stale — ask the user to update it (the ai-eval group is generated
# from the bundled API spec).
omni ai-eval --help >/dev/null 2>&1 || echo "ERROR: 'omni ai-eval' missing — update the Omni CLI."
# Show available profiles and select the right one — running against the wrong
# instance silently evaluates the wrong model.
omni config show
# If multiple profiles exist, ask the user which to use:
omni config use <profile-name>
# Confirm the active profile is authenticated and inspect your permissions:
omni whoami whoami
Auth: a profile authenticates with an API key or OAuth. If
whoami(or any call) returns 401, hand off — ask the user to run! omni config login <profile>(OAuth 2.1 browser flow; it blocks ~2 min on the browser). Don't runconfig loginyourself in a headless/CI session (no browser → timeout); on a local interactive machine you may. See theomni-api-conventionsrule for profile setup (omni config init --auth oauth) and discovering request-body shapes with--schema.
You also need the model ID of a shared model to evaluate. Evals require at least Querier access on that model, and at least one topic optimized for AI. See the Evals guide for concepts and prompt-set best practices.
How it works
| Concept | What it is |
|---|---|
| Prompt set | A reusable, named list of up to 25 natural-language prompts, scoped to one model. Each prompt may carry an optional expectation — a reference answer the judge scores against. Lives server-side; create one per topic, per release, or for regression coverage. |
| Eval run | Executes a prompt set against a model branch (or main). Each prompt runs as a full async agentic AI job (the same engine as production Blobby), then the accuracy judge scores the result. |
| Accuracy judge | A fixed judge model that reads the evaluated AI's full conversation and returns a pass/fail verdict per prompt, plus confidence and a rationale. It targets high-impact analysis errors (hallucinations, date/time filtering, row-limit handling, mental math, period-over-period mistakes, wrong topic). It does not grade wording or formatting. |
All commands accept -o json (or --compact) to force structured output for parsing, and --profile <name> / --branch-id style global flags. Run omni ai-eval <command> --help for the full flag list. For commands that take a --body (e.g. prompt-sets-create), run them with --schema to print the body's JSON schema and a filled example instead of guessing the JSON.
Step 1 — Build a prompt set
omni ai-eval prompt-sets-create --compact --body '{
"model_id": "your-model-id",
"name": "Orders regression",
"slug": "orders-regression",
"description": "Core revenue + orders coverage",
"prompts": [
{ "prompt_text": "Show me revenue by month" },
{ "prompt_text": "What are the top 5 products by revenue?",
"expectation": "The top product by revenue should be Aniseed Syrup." },
{ "prompt_text": "How many orders were placed last week?" }
]
}'
The response includes the created prompt_set.id (a UUID) — capture it for the run.
| Field | Required | Notes |
|---|---|---|
model_id | Yes | Shared model UUID the set is bound to |
name | Yes | ≤ 255 chars |
slug | Yes | Unique per model_id; must match ^[a-z][a-z0-9-]*$ |
description | No | ≤ 1024 chars |
prompts[] | No | ≤ 25; each needs prompt_text (≤ 8000 chars), optional expectation (≤ 16000 chars) |
Find or update an existing set instead of recreating:
omni ai-eval prompt-sets-list --model-ids your-model-id --compact # discover sets + ids
omni ai-eval prompt-sets-get <promptSetId> --compact # full set with prompts
prompt-sets-update replaces the entire prompts list: omitted prompts are deleted, entries with no id are created, entries with a matching id are updated in place. To add one prompt, send the full desired list (existing prompts carry their id).
omni ai-eval prompt-sets-update <promptSetId> --compact --body '{
"prompts": [
{ "id": "<existing-prompt-id>", "prompt_text": "Show me revenue by month" },
{ "prompt_text": "Revenue by month, last 12 months only" }
]
}'
Writing good prompts and expectations
- Mirror real user questions — pull from the AI usage analytics dashboard rather than inventing phrasing that echoes field names.
- Favor breadth over depth — one prompt across many topics yields more signal than ten on one topic.
- Add a regression prompt whenever an answer turns out wrong, so the same failure is caught next time.
- Set an
expectationonly when a prompt has a known answer worth pinning: a value or ranking ("top product should be Aniseed Syrup"), a direction ("revenue should be up YoY"), or a required breakdown. The judge treats a material divergence (wrong numbers, wrong direction, missing required result) as a failure but ignores wording/formatting differences. With no expectation, the judge decides whether the answer is correct on its own terms. The expectation is shown only to the judge — the evaluated AI never sees it.
Step 2 — Run an eval
# Against main:
omni ai-eval runs-create --compact --body '{
"prompt_set_id": "<promptSetId>",
"description": "Baseline on main"
}'
# Against a branch (measures a model-context change before promotion):
omni ai-eval runs-create --compact --body '{
"prompt_set_id": "<promptSetId>",
"description": "After adding ai_context to order_items",
"run_config": { "branch_id": "<branchId>" }
}'
The response returns run.id and job_count (one agentic job per prompt). Omit run_config.branch_id to run against the live shared model.
Concurrency cap: at most 2 eval runs in flight at once. A
429means the per-user active-run cap is reached — wait for an in-flight run to finish or cancel one. A503means eval is paused for the org. Checkruns-listbefore launching.
Poll for completion
omni ai-eval runs-get <runId> --compact
Poll with backoff (e.g. 5s, 10s, 20s) until the run's status is terminal — COMPLETE or CANCELLED. Track progress with each result's agentic_job.state (QUEUED → EXECUTING → COMPLETE/FAILED). Don't hammer the endpoint.
Step 3 — Read results
runs-get returns results[], one row per prompt:
{
"run": {
"status": "COMPLETE",
"branch_id": null,
"results": [
{
"prompt": "What are the top 5 products by revenue?",
"score": 1,
"error_reason": null,
"cost": 0.0021,
"scoring_cost": 0.0004,
"timing_ms": 4321,
"agentic_job": { "state": "COMPLETE", "conversation_id": "conv-uuid", "id": "job-uuid" }
}
]
}
}
| Field | Meaning |
|---|---|
score | Judge verdict for the prompt — pass = 1, fail = 0 |
error_reason | Set when the underlying agentic job failed |
cost / scoring_cost | LLM cost (USD) for the answer vs. for judging it |
timing_ms | Total AI time in ms — all LLM processing and tool calls (matches the "AI time" column in the UI), not wall-clock duration |
agentic_job.conversation_id | Open this chat to read the judge's full verdict, confidence, and rationale |
Overall accuracy = the pass rate (mean of score across results). Report it with the per-prompt breakdown, and for any failure, point to the conversation_id so the user can read why the judge failed it — that rationale is where the actionable signal lives.
Eval run "Baseline on main" — 9/12 passed (75.0%)
✗ "Revenue by quarter" — judge: summed a row-limited result as a total
✗ "Top products this year" — judge: date filter used calendar instead of fiscal year
✗ "Churn rate by segment" — agentic job FAILED (error_reason)
(open each conversation_id for the full rationale)
A/B comparison: branch vs main
The core workflow for measuring whether a model change helps. Run the same prompt set twice — once on main, once on the branch — then compare.
- Create or identify the branch with the change (use
omni-model-builderto applyai_context, fields, joins,ai_settings, etc. on a branch). runs-createwith norun_config→ baseline run on main.runs-createwithrun_config.branch_id→ branch run.- Poll both to
COMPLETE, then diff per-prompt verdicts.
A/B: main vs branch/new-context (prompt set: orders-regression)
main branch Δ
Accuracy: 75.0% 91.7% +16.7%
Prompt credits: 0.024 0.026 +0.002
Regressions (passed on main, failed on branch):
- rev-by-quarter
Improvements (failed on main, passed on branch):
- top-products-this-year
- churn-rate-by-segment
Always check for regressions, not just net improvement. A higher overall pass rate can still hide a prompt that newly broke — an
ai_contextchange that helps most prompts may conflict with one. Call out any prompt that passed on main but fails on the branch.
Notes for an auditable comparison:
- Both runs must use the same prompt set so they evaluate identical prompts (the AI Hub comparison view enforces this too).
- Record the full
branch_idused and confirm the branch run'srun.branch_idmatches it — don't claim the branch was exercised without that. - Expectations are snapshotted per run, so editing a prompt's
expectationlater won't change how earlier runs were scored.
Managing prompt sets and runs
omni ai-eval runs-list --prompt-set-id <promptSetId> --compact # runs for a set, newest first
omni ai-eval runs-cancel <runId> # cancel an in-flight run (also archives it)
omni ai-eval runs-archive <runId> # archive a finished run
omni ai-eval runs-unarchive <runId> # restore an archived run
omni ai-eval prompt-sets-archive <promptSetId> # archive a set (cancels its in-flight jobs)
omni ai-eval prompt-sets-unarchive <promptSetId> # restore an archived set
Archiving is a soft delete — sets and runs are preserved and can be restored. Cancelling a run marks it CANCELLED and archives it; use runs-unarchive to surface it in the default list again. Use --archived true on the *-list commands to see archived items.
Gotchas
- Right profile, right instance — the eval runs against whatever model lives on the active profile's instance. Confirm the profile before creating sets or runs.
- Judge is binary and fixed — a pass means the answer avoided the high-impact analysis errors, not that it's the single best possible answer. The judge model isn't configurable per run.
- Failed job ≠ failed judgment — a result with
error_reasonset means the agentic job itself failed (it never produced an answer to score); treat that separately from a judgescoreof0. - Snapshotting — to keep runs reproducible, capture the model state alongside results:
omni models yaml-get <modelId> --compactandomni models validate <modelId>. Branch runs already pin the change to a branch.
Docs reference
Related skills
- omni-ai-optimizer — improve AI accuracy based on eval findings (
ai_context,sample_queries, field metadata) - omni-model-builder — apply context changes on a branch before A/B testing
- omni-model-explorer — discover topics and fields when writing prompts