omni-ai-eval
Evaluasi akurasi pembuatan kueri Omni AI dengan menjalankan prompt uji melalui Omni CLI, membandingkan JSON kueri yang dihasilkan dengan hasil yang diharapkan, dan memberi skor…
npx skills add https://github.com/exploreomni/omni-agent-skills --skill omni-ai-evalOmni AI Eval
Omni ships a first-class eval system (the AI Hub → Prompt sets and Eval runs). This skill drives it through the Omni CLI: define a reusable prompt set, start a judged eval run against a model or branch, and read per-prompt verdicts from Omni's built-in accuracy judge.
Prefer this native system over building your own harness. The judge scores each answer semantically against the full agent conversation — it does not require golden query JSON, and it evaluates the whole agentic workflow (topic selection, queries, results, and the final written answer), not just generated query structure.
Tip: Use
omni-ai-optimizerto improve scores after finding failures,omni-model-builderto apply context changes on a branch before A/B testing, andomni-model-explorerto discover topics and fields when writing prompts.
Prerequisites
# Verify the Omni CLI is installed — if not, ask the user to install it.
# See: https://github.com/exploreomni/cli#readme
command -v omni >/dev/null || echo "ERROR: Omni CLI is not installed."
# Verify the CLI has the eval commands. If this errors with "unknown command",
# the binary is stale — ask the user to update it (the ai-eval group is generated
# from the bundled API spec).
omni ai-eval --help >/dev/null 2>&1 || echo "ERROR: 'omni ai-eval' missing — update the Omni CLI."
# Show available profiles and select the right one — running against the wrong
# instance silently evaluates the wrong model.
omni config show
# If multiple profiles exist, ask the user which to use:
omni config use <profile-name>
# Confirm the active profile is authenticated and inspect your permissions:
omni whoami whoami
Auth: a profile authenticates with an API key or OAuth. If
whoami(or any call) returns 401, hand off — ask the user to run! omni config login <profile>(OAuth 2.1 browser flow; it blocks ~2 min on the browser). Don't runconfig loginyourself in a headless/CI session (no browser → timeout); on a local interactive machine you may. See theomni-api-conventionsrule for profile setup (omni config init --auth oauth) and discovering request-body shapes with--schema.
You also need the model ID of a shared model to evaluate. Evals require at least Querier access on that model, and at least one topic optimized for AI. See the Evals guide for concepts and prompt-set best practices.
How it works
| Concept | What it is |
|---|---|
| Prompt set | A reusable, named list of up to 25 natural-language prompts, scoped to one model. Each prompt may carry an optional expectation — a reference answer the judge scores against. Lives server-side; create one per topic, per release, or for regression coverage. |
| Eval run | Executes a prompt set against a model branch (or main). Each prompt runs as a full async agentic AI job (the same engine as production Blobby), then the accuracy judge scores the result. |
| Accuracy judge | A fixed judge model that reads the evaluated AI's full conversation and returns a pass/fail verdict per prompt, plus confidence and a rationale. It targets high-impact analysis errors (hallucinations, date/time filtering, row-limit handling, mental math, period-over-period mistakes, wrong topic). It does not grade wording or formatting. |
All commands accept -o json (or --compact) to force structured output for parsing, and --profile <name> / --branch-id style global flags. Run omni ai-eval <command> --help for the full flag list. For commands that take a --body (e.g. prompt-sets-create), run them with --schema to print the body's JSON schema and a filled example instead of guessing the JSON.
Step 1 — Build a prompt set
omni ai-eval prompt-sets-create --compact --body '{
"model_id": "your-model-id",
"name": "Orders regression",
"slug": "orders-regression",
"description": "Core revenue + orders coverage",
"prompts": [
{ "prompt_text": "Show me revenue by month" },
{ "prompt_text": "What are the top 5 products by revenue?",
"expectation": "The top product by revenue should be Aniseed Syrup." },
{ "prompt_text": "How many orders were placed last week?" }
]
}'
The response includes the created prompt_set.id (a UUID) — capture it for the run.
| Field | Required | Notes |
|---|---|---|
model_id | Yes | Shared model UUID the set is bound to |
name | Yes | ≤ 255 chars |
slug | Yes | Unique per model_id; must match ^[a-z][a-z0-9-]*$ |
description | No | ≤ 1024 chars |
prompts[] | No | ≤ 25; each needs prompt_text (≤ 8000 chars), optional expectation (≤ 16000 chars) |
Find or update an existing set instead of recreating:
omni ai-eval prompt-sets-list --model-ids your-model-id --compact # discover sets + ids
omni ai-eval prompt-sets-get <promptSetId> --compact # full set with prompts
prompt-sets-update replaces the entire prompts list: omitted prompts are deleted, entries with no id are created, entries with a matching id are updated in place. To add one prompt, send the full desired list (existing prompts carry their id).
omni ai-eval prompt-sets-update <promptSetId> --compact --body '{
"prompts": [
{ "id": "<existing-prompt-id>", "prompt_text": "Show me revenue by month" },
{ "prompt_text": "Revenue by month, last 12 months only" }
]
}'
Writing good prompts and expectations
- Mirror real user questions — pull from the AI usage analytics dashboard rather than inventing phrasing that echoes field names.
- Favor breadth over depth — one prompt across many topics yields more signal than ten on one topic.
- Add a regression prompt whenever an answer turns out wrong, so the same failure is caught next time.
- Set an
expectationonly when a prompt has a known answer worth pinning: a value or ranking ("top product should be Aniseed Syrup"), a direction ("revenue should be up YoY"), or a required breakdown. The judge treats a material divergence (wrong numbers, wrong direction, missing required result) as a failure but ignores wording/formatting differences. With no expectation, the judge decides whether the answer is correct on its own terms. The expectation is shown only to the judge — the evaluated AI never sees it.
Step 2 — Run an eval
# Against main:
omni ai-eval runs-create --compact --body '{
"prompt_set_id": "<promptSetId>",
"description": "Baseline on main"
}'
# Against a branch (measures a model-context change before promotion):
omni ai-eval runs-create --compact --body '{
"prompt_set_id": "<promptSetId>",
"description": "After adding ai_context to order_items",
"run_config": { "branch_id": "<branchId>" }
}'
The response returns run.id and job_count (one agentic job per prompt). Omit run_config.branch_id to run against the live shared model.
Concurrency cap: at most 2 eval runs in flight at once. A
429means the per-user active-run cap is reached — wait for an in-flight run to finish or cancel one. A503means eval is paused for the org. Checkruns-listbefore launching.
Poll for completion
omni ai-eval runs-get <runId> --compact
Poll with backoff (e.g. 5s, 10s, 20s) until the run's status is terminal — COMPLETE or CANCELLED. Track progress with each result's agentic_job.state (QUEUED → EXECUTING → COMPLETE/FAILED). Don't hammer the endpoint.
Step 3 — Read results
runs-get returns results[], one row per prompt:
{
"run": {
"status": "COMPLETE",
"branch_id": null,
"results": [
{
"prompt": "What are the top 5 products by revenue?",
"score": 1,
"error_reason": null,
"cost": 0.0021,
"scoring_cost": 0.0004,
"timing_ms": 4321,
"agentic_job": { "state": "COMPLETE", "conversation_id": "conv-uuid", "id": "job-uuid" }
}
]
}
}
| Field | Meaning |
|---|---|
score | Judge verdict for the prompt — pass = 1, fail = 0 |
error_reason | Set when the underlying agentic job failed |
cost / scoring_cost | LLM cost (USD) for the answer vs. for judging it |
timing_ms | Total AI time in ms — all LLM processing and tool calls (matches the "AI time" column in the UI), not wall-clock duration |
agentic_job.conversation_id | Open this chat to read the judge's full verdict, confidence, and rationale |
Overall accuracy = the pass rate (mean of score across results). Report it with the per-prompt breakdown, and for any failure, point to the conversation_id so the user can read why the judge failed it — that rationale is where the actionable signal lives.
Eval run "Baseline on main" — 9/12 passed (75.0%)
✗ "Revenue by quarter" — judge: summed a row-limited result as a total
✗ "Top products this year" — judge: date filter used calendar instead of fiscal year
✗ "Churn rate by segment" — agentic job FAILED (error_reason)
(open each conversation_id for the full rationale)
A/B comparison: branch vs main
The core workflow for measuring whether a model change helps. Run the same prompt set twice — once on main, once on the branch — then compare.
- Create or identify the branch with the change (use
omni-model-builderto applyai_context, fields, joins,ai_settings, etc. on a branch). runs-createwith norun_config→ baseline run on main.runs-createwithrun_config.branch_id→ branch run.- Poll both to
COMPLETE, then diff per-prompt verdicts.
A/B: main vs branch/new-context (prompt set: orders-regression)
main branch Δ
Accuracy: 75.0% 91.7% +16.7%
Prompt credits: 0.024 0.026 +0.002
Regressions (passed on main, failed on branch):
- rev-by-quarter
Improvements (failed on main, passed on branch):
- top-products-this-year
- churn-rate-by-segment
Always check for regressions, not just net improvement. A higher overall pass rate can still hide a prompt that newly broke — an
ai_contextchange that helps most prompts may conflict with one. Call out any prompt that passed on main but fails on the branch.
Notes for an auditable comparison:
- Both runs must use the same prompt set so they evaluate identical prompts (the AI Hub comparison view enforces this too).
- Record the full
branch_idused and confirm the branch run'srun.branch_idmatches it — don't claim the branch was exercised without that. - Expectations are snapshotted per run, so editing a prompt's
expectationlater won't change how earlier runs were scored.
Managing prompt sets and runs
omni ai-eval runs-list --prompt-set-id <promptSetId> --compact # runs for a set, newest first
omni ai-eval runs-cancel <runId> # cancel an in-flight run (also archives it)
omni ai-eval runs-archive <runId> # archive a finished run
omni ai-eval runs-unarchive <runId> # restore an archived run
omni ai-eval prompt-sets-archive <promptSetId> # archive a set (cancels its in-flight jobs)
omni ai-eval prompt-sets-unarchive <promptSetId> # restore an archived set
Archiving is a soft delete — sets and runs are preserved and can be restored. Cancelling a run marks it CANCELLED and archives it; use runs-unarchive to surface it in the default list again. Use --archived true on the *-list commands to see archived items.
Gotchas
- Right profile, right instance — the eval runs against whatever model lives on the active profile's instance. Confirm the profile before creating sets or runs.
- Judge is binary and fixed — a pass means the answer avoided the high-impact analysis errors, not that it's the single best possible answer. The judge model isn't configurable per run.
- Failed job ≠ failed judgment — a result with
error_reasonset means the agentic job itself failed (it never produced an answer to score); treat that separately from a judgescoreof0. - Snapshotting — to keep runs reproducible, capture the model state alongside results:
omni models yaml-get <modelId> --compactandomni models validate <modelId>. Branch runs already pin the change to a branch.
Docs reference
Related skills
- omni-ai-optimizer — improve AI accuracy based on eval findings (
ai_context,sample_queries, field metadata) - omni-model-builder — apply context changes on a branch before A/B testing
- omni-model-explorer — discover topics and fields when writing prompts