omni-ai-evalby exploreomni

Evaluate Omni AI query generation accuracy by running test prompts through the Omni CLI, comparing generated query JSON against expected results, and scoring…

npx skills add https://github.com/exploreomni/omni-agent-skills --skill omni-ai-eval

Omni Eval

Run evals against Omni's AI query generation APIs — submit test prompts, capture the generated query JSON, compare it against expected results, and score accuracy across dimensions.

Tip: Use omni-ai-optimizer to improve scores after identifying failures, and omni-model-explorer to discover available topics and fields for building eval cases.

Prerequisites

# Verify the Omni CLI is installed — if not, ask the user to install it
# See: https://github.com/exploreomni/cli#readme
command -v omni >/dev/null || echo "ERROR: Omni CLI is not installed."
# Show available profiles and select the appropriate one
omni config show
# If multiple profiles exist, ask the user which to use, then switch:
omni config use <profile-name>

You also need a model ID and an eval set — a file of test cases with prompts and expected query structures. See the Eval Design Guide for best practices on building eval sets.

Discovering Commands

omni ai --help    # AI operations (generate-query, jobs, pick-topic)

Tip: Use -o json to force structured output for programmatic parsing, or -o human for readable tables. The default is auto (human in a TTY, JSON when piped).

Eval Input Format

Each eval case pairs a natural language prompt with the expected query structure. JSONL (one JSON object per line) works well for bulk runs:

{"id": "rev-by-month", "prompt": "Show me revenue by month", "modelId": "your-model-id", "expected": {"topic": "order_items", "fields": ["order_items.created_at[month]", "order_items.total_revenue"], "filters": {}, "sorts": [{"column_name": "order_items.created_at[month]", "sort_descending": false}]}, "tags": ["time-series"]}
{"id": "top-customers", "prompt": "Top 10 customers by spend", "modelId": "your-model-id", "expected": {"topic": "order_items", "fields": ["users.name", "order_items.total_revenue"], "filters": {}, "sorts": [{"column_name": "order_items.total_revenue", "sort_descending": true}]}, "tags": ["top-n"]}
FieldRequiredDescription
idYesUnique identifier for the eval case
promptYesNatural language question to send to AI
modelIdYesTarget model UUID
expectedYesObject with topic, fields, filters, sorts
branchIdNoBranch to test against
currentTopicNameNoConstrain to a specific topic
tagsNoArray of tags for filtering/grouping results

Note: JSONL is shown here, but any structured format works — CSV, JSON arrays, YAML — as long as you can iterate over cases and extract these fields.

Running Evals: Fast Path (Generate Query API)

The synchronous generate-query endpoint is the fastest way to eval query generation. Pass --run-query false to get only the generated query JSON without executing it against the database.

Single Eval Call

omni ai generate-query your-model-id "Show me revenue by month" --run-query false

Response Structure

{
  "query": {
    "fields": ["order_items.created_at[month]", "order_items.total_revenue"],
    "table": "order_items",
    "filters": {},
    "sorts": [{"column_name": "order_items.created_at[month]", "sort_descending": false}],
    "limit": 500
  },
  "topic": "order_items",
  "error": null
}

Request Parameters

Arg/FlagRequiredDescription
<model-id>YesUUID of the Omni model (positional arg)
<prompt>YesNatural language question (positional arg)
--run-queryNoSet false to skip query execution (faster, default true)
--branch-idNoBranch UUID for branch-specific testing
--current-topic-nameNoConstrain topic selection to a specific topic

Batch Loop (bash)

while IFS= read -r line; do
  id=$(echo "$line" | jq -r '.id')
  prompt=$(echo "$line" | jq -r '.prompt')
  model_id=$(echo "$line" | jq -r '.modelId')
  branch_id=$(echo "$line" | jq -r '.branchId // empty')

  branch_flag=""
  if [ -n "$branch_id" ]; then
    branch_flag="--branch-id $branch_id"
  fi

  result=$(omni ai generate-query "$model_id" "$prompt" --run-query false $branch_flag --compact)

  echo "{\"id\": \"$id\", \"generated\": $result}" >> eval_results.jsonl
done < eval_cases.jsonl

Running Evals: Agentic Path (AI Jobs API)

Use the async AI Jobs API when you want to test the full agentic workflow — multi-step analysis, tool use, and topic selection as Blobby would actually behave in production.

Submit a Job

omni ai job-submit your-model-id "Show me revenue by month"

Response:

{
  "jobId": "job-uuid",
  "conversationId": "conv-uuid",
  "omniChatUrl": "https://yourorg.omniapp.co/chat/..."
}

Poll for Completion

omni ai job-status <jobId>

Status progression: QUEUEDEXECUTINGCOMPLETE (or FAILED). Poll with backoff (e.g., 2s, 4s, 8s) until the state is terminal.

Get Result

omni ai job-result <jobId>

The result contains an actions array. Look for actions with type: "generate_query" to extract the query JSON:

{
  "actions": [
    {
      "type": "generate_query",
      "message": "Querying revenue by month...",
      "result": {
        "queryName": "Revenue by Month",
        "query": { "fields": [...], "table": "...", "filters": {...} },
        "status": "success",
        "totalRowCount": 12
      }
    }
  ],
  "topic": "order_items",
  "resultSummary": "Here are the monthly revenue figures..."
}

When to Use Which Path

CriterionGenerate Query (Fast)AI Jobs (Agentic)
SpeedSynchronous, fastAsync, slower
VolumeHigh-volume runsLower volume
ScopeQuery generation onlyFull agent workflow
Use caseField/filter accuracyEnd-to-end behavior
Multi-stepSingle queryMay generate multiple queries

Testing Topic Selection

Eval topic selection independently with the pick-topic endpoint:

omni ai pick-topic your-model-id "How many users signed up last month?"

Response:

{
  "topicId": "users"
}

This lets you score topic selection accuracy as a separate dimension — useful when topic selection is a known weak point.

Scoring: Structural Query Comparison

Compare the generated query JSON against the expected query across four dimensions:

DimensionComparison MethodScoring
topicExact string matchpass/fail
fieldsSet comparison (order-independent)pass/fail + similarity score
filtersKey-value match (key present + value match)pass/fail per filter key
sortsOrdered array comparisonpass/fail

Example Comparison Logic (TypeScript)

function scoreEval(expected: any, generated: any) {
  // Topic: exact match
  const topicPass = generated.topic === expected.topic;

  // Fields: set comparison (order-independent)
  const expectedFields = new Set(expected.fields);
  const generatedFields = new Set(generated.query.fields);
  const missing = [...expectedFields].filter(f => !generatedFields.has(f));
  const extra = [...generatedFields].filter(f => !expectedFields.has(f));
  const fieldsPass = missing.length === 0 && extra.length === 0;

  // Filters: key-value match
  const expectedFilters = expected.filters || {};
  const generatedFilters = generated.query.filters || {};
  const missingKeys = Object.keys(expectedFilters).filter(k => !(k in generatedFilters));
  const wrongValues = Object.keys(expectedFilters)
    .filter(k => k in generatedFilters && generatedFilters[k] !== expectedFilters[k]);
  const filtersPass = missingKeys.length === 0 && wrongValues.length === 0;

  // Sorts: ordered comparison
  const sortsPass = JSON.stringify(expected.sorts || []) ===
    JSON.stringify(generated.query.sorts || []);

  return {
    topic: topicPass,
    fields: { pass: fieldsPass, missing, extra },
    filters: { pass: filtersPass, missingKeys, wrongValues },
    sorts: sortsPass,
    allPass: topicPass && fieldsPass && filtersPass && sortsPass,
  };
}

Aggregate Scoring

Compute pass rates across all eval cases:

Eval Results: 47/50 passed (94.0%)
  Topic:   49/50 (98.0%)
  Fields:  47/50 (94.0%)
  Filters: 48/50 (96.0%)
  Sorts:   50/50 (100.0%)

Per-dimension rates help pinpoint where accuracy is weakest — if topic accuracy is high but filter accuracy is low, focus ai_context improvements on filter-related guidance.

A/B Comparison

Run the same eval suite with one variable changed to measure impact. This is the core workflow for understanding whether a change improves or degrades AI accuracy.

Common Variables to Compare

  • Model branches — pass different --branch-id values to test context changes on a branch before merging
  • Topic scope--current-topic-name "orders" vs omitted (auto-select)
  • Model context changesai_context, sample_queries, field descriptions (apply via omni-model-builder on a branch, then eval against that branch)
  • Prompt wording — same expected query, different prompt text
  • AI configuration — model type, thinking level, or other AI parameters

Workflow

  1. Run eval suite with configuration A → save as results_a.jsonl
  2. Run eval suite with configuration B → save as results_b.jsonl
  3. Score both result sets
  4. Compare side-by-side, checking for regressions

Example Comparison Output

A/B Comparison: main vs branch/new-context
                      A (main)    B (new-context)    Delta
Overall pass rate:    88.0%       94.0%              +6.0%
Topic accuracy:       96.0%       98.0%              +2.0%
Field accuracy:       90.0%       94.0%              +4.0%
Filter accuracy:      88.0%       96.0%              +8.0%

Regressions (passed in A, failed in B):
  - rev-by-quarter: fields missing order_items.total_revenue

Improvements (failed in A, passed in B):
  - customer-count: topic now correctly selects users
  - top-products: filters now include status=complete

Important: Always check for regressions, not just overall improvement. A net improvement that breaks previously-correct cases may indicate an ai_context conflict.

Snapshotting Model State

Before running evals, snapshot the model definition so results are reproducible:

# Save model YAML
omni models yaml-get <modelId> --compact > model_snapshot_$(date +%Y%m%d).json

# Validate model integrity
omni models validate <modelId>

Version your eval set alongside model snapshots so you can trace which model state produced which scores.

Known Issues & Gotchas

  • Filter comparison can be complex — Omni supports rich filter expressions ("last 7 days", "between 10 and 100", "not null"). The structural comparison above uses exact string match on filter values. If the AI produces semantically equivalent but syntactically different expressions, you may see false failures. Consider normalizing common patterns or using a Jaccard threshold.
  • AI Jobs are async — poll with exponential backoff. Don't hammer the status endpoint.
  • Rate limiting — for high-volume eval runs, add a small delay between calls or batch requests.
  • limit field may vary — the AI may choose different limits than expected. Consider excluding limit from strict comparison if it's not critical to your eval.
  • table vs topic — the generate-query response returns topic as a top-level field and table inside the query object. These usually match but aren't always identical. Compare against the top-level topic.

Docs Reference

Related Skills

  • omni-query — run golden queries to validate expected results
  • omni-model-explorer — discover topics and fields for building eval cases
  • omni-ai-optimizer — improve AI accuracy based on eval findings
  • omni-model-builder — apply context changes on branches before A/B testing

More skills from exploreomni

omni-admin
by exploreomni
Administer an Omni Analytics instance — manage connections, users, groups, user attributes, permissions, schedules, and schema refreshes via the Omni CLI. Use…
omni-ai-optimizer
by exploreomni
Optimize your Omni Analytics model for Blobby, the Omni Agent — configure ai_context, ai_fields, sample_queries, and create AI-specific topic extensions. Use…
omni-content-builder
by exploreomni
Create, update, and manage Omni Analytics documents and dashboards programmatically — document lifecycle, tiles, visualizations, filters, and layouts — using…
omni-content-explorer
by exploreomni
Find, browse, and organize content in Omni Analytics — dashboards, workbooks, folders, and labels — using the Omni CLI. Use this skill whenever someone wants…
omni-embed
by exploreomni
Embed Omni Analytics dashboards in external applications — URL signing, custom themes, iframe events, entity workspaces, and permission-aware content — using…
omni-model-builder
by exploreomni
Create and edit Omni Analytics semantic model definitions — views, topics, dimensions, measures, relationships, and query views — using YAML through the Omni…
omni-model-explorer
by exploreomni
Discover and inspect Omni Analytics models, topics, views, fields, dimensions, measures, and relationships using the Omni CLI. Use this skill whenever someone…
omni-query
by exploreomni
Run queries against Omni Analytics' semantic layer using the Omni CLI, interpret results, and chain queries for multi-step analysis. Use this skill whenever…

NotebookLM Web Importer

Import web pages and YouTube videos to NotebookLM with one click. Trusted by 200,000+ users.

Install Chrome Extension