experiment-analyzeroleh datadog-labs

Analyze LLM experiment results. Handles single or comparative experiments, exploratory or Q&A modes. Use when user says "analyze experiment", "compare…

npx skills add https://github.com/datadog-labs/agent-skills --skill experiment-analyzer

Unified Experiment Analyzer

Analyzes one or two LLM experiments. Supports four modes based on inputs:

InputsMode
2 IDs, no questionComparative Exploratory
2 IDs + questionComparative Q&A
1 ID, no questionSingle Exploratory
1 ID + questionSingle Q&A

Usage

/experiment-analyzer <experiment_id_1> [experiment_id_2] [question text] [--output agent|file|notebook]

Arguments: $ARGUMENTS

Available Tools

ToolPurpose
mcp__datadog-llmo-mcp__get_llmobs_experiment_summaryGet total events, error count, metrics stats, available dimensions
mcp__datadog-llmo-mcp__list_llmobs_experiment_eventsQuery events with filters, sorting, pagination
mcp__datadog-llmo-mcp__get_llmobs_experiment_eventGet full event details (input, output, expected_output, metrics)
mcp__datadog-llmo-mcp__get_llmobs_experiment_metric_valuesGet metric stats overall and segmented by dimension
mcp__datadog-llmo-mcp__get_llmobs_experiment_dimension_valuesList unique values for a dimension with counts
mcp__datadog-mcp-core__create_datadog_notebookExport report as a Datadog notebook

Phase 0 — Mode & Output Resolution

Parse $ARGUMENTS:

  1. Extract one or two UUID-format strings as experiment IDs (first = baseline/primary, second = candidate).
  2. Extract --output agent|file|notebook flag if present.
  3. The remaining text (after IDs and flags) is the question, if any.

Mode determination:

  • 2 IDs + question → Comparative Q&A
  • 2 IDs, no question → Comparative Exploratory
  • 1 ID + question → Single Q&A
  • 1 ID, no question → Single Exploratory

Output mode determination:

If --output was provided in arguments, use that mode and skip asking.

Otherwise, ask one combined clarification message before proceeding. Cover only what is genuinely unclear:

  • If mode is ambiguous (e.g., user asked a question but only provided IDs in surrounding context), ask in plain language: "Did you have a specific question in mind, or would you like an exploratory analysis?"
  • Always ask about output destination if not specified: "Would you like me to save this to a file, export it to a Datadog notebook, or is displaying it here in chat fine?"

Never ask multiple rounds of clarifications. One message covers everything unresolved.

Output modes:

  1. Agent (default): Display the full report in the conversation.
  2. File: Before starting, propose a path: evals/reports/YYYY-MM-DD-<experiment-slug>-analysis.md Present it to the user and let them confirm or adjust. Then proceed.
  3. Notebook: Use mcp__datadog-mcp-core__create_datadog_notebook at the end. If the tool is unavailable, output these setup instructions instead of failing:
    To enable Datadog notebook export, add the MCP server:
      claude mcp add --transport http datadog-mcp https://mcp.datadoghq.com/api/unstable/mcp-server
    See: https://docs.datadoghq.com/bits_ai/mcp_server/setup/
    
    Then ask: "Would you like to fall back to file or agent output instead?" See Phase 5 for full notebook call details.

After resolving mode and output, proceed fully automatically through Phases 1–5 with no further user interaction.


Phase 1 — Orient

Comparative: Call get_llmobs_experiment_summary for both experiments. Produce a side-by-side comparison:

  • Scale: total events and error rate for each
  • Metrics: which metrics exist in each; which are shared
  • Dimensions: which dimensions exist in each; which are shared
  • Immediate red flags (high error rate, missing metrics, sparse data)
  • Obvious improvements or regressions visible at the summary level

Single: Call get_llmobs_experiment_summary for the experiment. Determine:

  • Total events, error count, error rate
  • Available metrics (classify as exact-match vs. rubric/quality)
  • Available dimensions for segmentation
  • Any immediate red flags

Phase 2 — Signal Discovery + UI Links

Comparative: Using only shared metrics and dimensions, identify:

  • Segments where the candidate outperforms the baseline
  • Segments where the candidate regresses
  • Error types present in one but rare in the other
  • Distribution shifts or coverage gaps
  • Tradeoffs (e.g., higher recall, lower precision)

Generate Datadog comparison UI links:

  • Base URL: https://app.datadoghq.com/llm/experiment-comparison
  • Required params: baselineExperimentId, experimentIds (candidate%2Cbaseline), tableView=all
  • Optional (include if discoverable): project, compareDatasetId, selectedEvaluation
  • selectedEvaluation priority: overall/overall_score/rubric metric → primary metric → first shared metric
  • Generate 2–4 links: primary comparison, regression view, calibration view (if applicable), worst-segment view (only if supported — never fabricate filters)

Single: Measure per-metric performance across all dimensions. Identify:

  • Worst-performing segments (by metric × dimension)
  • Any segments with surprising pass rates
  • Overall pass rates and variance

Generate Datadog experiment UI link:

  • https://app.datadoghq.com/llm/experiments/{experiment_id}

Phase 3 — Deep Dives

Run all necessary deep dives automatically. Do not ask for approval or pause.

Q&A modes: Focus deep dives on what is needed to answer the question directly. Pull specific events, segment by relevant dimensions, inspect examples.

Exploratory modes: Investigate the most interesting signals broadly:

  • Per-segment and per-class delta analysis (comparative) or pass-rate analysis (single)
  • Error overlap vs. unique failure mode analysis
  • Sampling and qualitative inspection of representative failures (2–5 per issue)
  • Clustered error theme analysis

Rules:

  • Prefer cheap, high-signal analyses first; do not stop early.
  • Mask or redact PII in all outputs.
  • Avoid destructive actions.

For each sampled event, generate a direct span link: https://app.datadoghq.com/llm/experiments/{experiment_id}?selectedTab=overview&sp=[{"p":{"experimentId":"{experiment_id}","spanId":"{span_id}"},"i":"experiment-details"}]&spanId={span_id}

For each Deep Dive segment, generate a direct link to view those events in the (candidate) experiment: https://app.datadoghq.com/llm/experiments/{experiment_id}?selectedTab=overview&filter[{dimension}]={value} If you are not confident the filter URL format works for this dimension, omit the filter params and link to the experiment root instead. Never fabricate filter URLs.


Phase 4 — Synthesis

Comparative Exploratory:

  • Clear wins where the candidate improves on the baseline
  • Clear regressions or risks the candidate introduces
  • Neutral or unchanged areas
  • Root-cause hypotheses (1–4), tied to evidence
  • Prioritized recommendations: ship as-is / block / gate by segment / combine behaviors

Comparative Q&A:

  • Direct answer to the question with a clear verdict
  • Supporting evidence (metrics, percentages, event examples)
  • Relevant context (e.g., caveats, data limitations)

Single Exploratory:

  • Overall performance assessment
  • Worst-performing segments and root causes
  • Hypotheses for why failures occur
  • Recommended next experiments

Single Q&A:

  • Direct answer to the question with a clear verdict
  • Supporting evidence from the experiment data

All modes: use quantified deltas/rates wherever possible. Redact PII.


Phase 5 — Output Delivery

Agent: Present the full report in the conversation using the report format below.

File: Write the report to the pre-confirmed path. Confirm with: "Report saved to <path>."

Notebook: Call mcp__datadog-mcp-core__create_datadog_notebook with the following parameters:

  • name (by mode):

    ModeName
    Comparative ExploratoryExperiment Analysis: {baseline_short} (Baseline) vs {candidate_short} (Candidate) — YYYY-MM-DD
    Comparative Q&AExperiment Q&A: {baseline_short} vs {candidate_short} — YYYY-MM-DD
    Single ExploratoryExperiment Analysis: {experiment_short} — YYYY-MM-DD
    Single Q&AExperiment Q&A: {experiment_short} — YYYY-MM-DD
    where short = first 8 characters of the UUID.
  • cells: the full report as a single markdown cell — [{ "type": "markdown", "text": "<full report markdown>" }]. Omit the # Experiment Analysis Report top-level heading from the cell content — it is already shown as the notebook title.

  • time: { "live_span": "1h" }

After the notebook is created, output the URL in chat: "Report exported to notebook: <url>"

If the tool is unavailable, follow the fallback instructions in Phase 0.


Phase 6 — Conversational Follow-up

After delivering the report, append a follow-up section:

---
## Want to explore further?

Here are a few directions based on the findings:

1. [Specific question derived from actual findings — e.g., "Want me to dig deeper into why the SQL scenarios regressed in the candidate?"]
2. [Another specific follow-up — e.g., "Should I compare error patterns between the two failing clusters?"]
3. [A third option if relevant]

Do you have any other questions about this analysis?

Stay active after the report. Answer follow-up questions using the same MCP tools, referencing findings already gathered. Do not re-run analyses you've already performed unless new questions require it.


Report Format

Link rules:

  • Experiment IDs: Wherever a full experiment UUID appears, render it as a Markdown link to https://app.datadoghq.com/llm/experiments/{full_uuid}.
  • Comparative table column headers: In the Orientation table and in every subsequent table that has Baseline/Candidate columns, wrap the entire column header as a link — not just the short ID. Format: [Baseline \{short_id}`]({baseline_url})andCandidate `{short_id}``. This makes the full header cell clickable, not just the ID portion.
# Experiment Analysis Report

> **Question:** {original question text}
> _(Q&A modes only — omit for Exploratory modes)_

## Summary & Recommendations

[Comparative: **Baseline**: [`{baseline_short}`]({baseline_url}) | **Candidate**: [`{candidate_short}`]({candidate_url}) | **Compare**: [`{baseline_short}-{candidate_short}`](https://app.datadoghq.com/llm/experiment-comparison?baselineExperimentId={baseline_id}&experimentIds={candidate_id}%2C{baseline_id}&tableView=all&selectedEvaluation=pass) — Single: **Experiment**: [`{experiment_short}`]({experiment_url}). Always the first line of this section.]

[2–3 sentence executive summary directly below the links line. Open with "This is a **{Mode}** analysis..." where {Mode} is one of: Comparative Exploratory, Comparative Q&A, Single Exploratory, Single Q&A. Include experiment(s) purpose, scale, and key finding with specific numbers.]

[If the report uses opaque dimension values (e.g. category labels like b1/b2/b3/bx), add a `### Dataset Categories` subsection here. Include: one sentence explaining where the categories come from (i.e. labels from the evaluation dataset grouping questions by required tools/data sources), then a bullet per value with its name bolded and a brief description. Infer descriptions from input question patterns, capability tags, and expected tool calls. Omit this subsection if all dimension values are self-explanatory.]

[Wins, regressions, neutral areas, prioritized actions. For Q&A: verdict + rationale.]

## Orientation

[Side-by-side table for comparative; summary table for single. Include: events, error rate, metrics, dimensions. Experiment IDs in column headers must be Markdown links.]


## What Changed

[Comparative modes only. Table of differences between baseline and candidate: model, toolset/skill profile,
dataset, evaluator schema, and any other metadata differences detectable from the summary data.
If no differences are detectable, write: "No configuration differences detected between experiments."]

## [Signals | Answer to Question]

[For exploratory: ranked table of signals/segments with metric deltas and impact counts.]
[For Q&A: direct answer with verdict, then supporting evidence.]

## Deep Dive Findings

### [Issue/Finding Title]

**Segment**: `[dimension=value]` | **Impact**: N events | **Severity**: metric pass rate = X% | [View events](https://app.datadoghq.com/llm/experiments/{experiment_id}?selectedTab=overview&filter[{dimension}]={value})

**What's happening**: [1–2 sentences: key observation and metric impact only]

**Representative examples**:
- [Span link]: [input → output → expected, what went wrong]

**Root cause hypothesis**: [Category]: [Explanation tied to evidence]

**Recommendation**: [Specific, actionable next step]

---
[Repeat for each major issue]

## UI Links

[All generated Datadog UI links with labels]

Operating Rules

  • Do not assume anything about the experiment (model, task, metrics, schema, dimensions). Infer everything by inspecting the data.
  • Ground all conclusions in specific evidence: event IDs, counts, percentages.
  • Show math: include counts and rates, not just qualitative claims.
  • Avoid speculative explanations not supported by observed evidence.
  • Mask or redact PII in all user-visible output.
  • Never show internal tool calls, schemas, or implementation details to the user.
  • MCP result parsing safety: Before writing any script (Python, jq, etc.) that iterates over or accesses fields in an MCP tool result, inspect the raw structure first — check type(result), top-level keys, and whether the payload is nested inside a content block (e.g. [{'type': 'text', 'text': '<json>'}]). Extract and json.loads() the inner payload if needed before parsing. Never assume MCP results are bare dicts or lists.

NotebookLM Web Importer

Impor halaman web dan video YouTube ke NotebookLM dengan satu klik. Dipercaya oleh 200.000+ pengguna.

Instal Ekstensi Chrome