exploring-apm-traces

PostHog erfasst verteilte Traces von OpenTelemetry. Jeder Trace ist ein Baum von Spans, der den Pfad einer Anfrage durch Dienste darstellt.

npx skills add https://github.com/posthog/ai-plugin --skill exploring-apm-traces

Exploring APM traces (OpenTelemetry spans)

PostHog captures distributed traces from OpenTelemetry. Each trace is a tree of spans representing a request's path through services.

Disambiguation: This skill is for APM / OpenTelemetry traces. Do not confuse with AI observability traces (agent/model $ai_* events) or logs (posthog:query-logs, posthog:logs-*).

Available tools

ToolPurpose
posthog:query-apm-spansSearch and filter spans (compact list view)
posthog:apm-trace-getGet the full span list for one hex trace_id
posthog:apm-spans-aggregatePer-operation aggregates (count, p50/p95, errors)
posthog:apm-spans-treeCall-tree aggregates per (parent, child) edge
posthog:apm-spans-countScalar span count — cheap filter pre-flight
posthog:apm-spans-sparklineSpan counts over time (zero-filled time series)
posthog:apm-spans-duration-histogramTrace counts per log-scale duration bucket
posthog:apm-attribute-breakdownSpan counts grouped by one attribute's value
posthog:apm-services-listList distinct service names
posthog:apm-attributes-listList span or resource attribute keys
posthog:apm-attribute-values-listList values for a specific attribute key

See references/spans-and-fields.md for the response schema and the kind/status_code enums.

Workflow: debug a trace from a URL

Step 1 — Fetch the trace

posthog:apm-trace-get
{
  "trace_id": "<hex_trace_id>"
}

The response is { results: [span, span, …] } — a flat list of every span in the trace. The list can be very large for fan-out request flows; when it exceeds the inline limit, Claude Code auto-persists it to a file.

From the result you get:

  • Every span with name, service_name, kind, status_code, parent_span_id, duration_nano, is_root_span
  • The _posthogUrlalways include this in your response so the user can click through to the UI

Step 2 — Parse large results with scripts

When the result is persisted to a file (traces with hundreds of spans across services), use the parsing scripts to explore it.

Start with the summary to get the full picture, then drill into specifics:

# 1. Overview: services, span count, slowest spans, errors
python3 scripts/print_summary.py /path/to/persisted-file.json

# 2. Indented chronological tree (DFS by parent_span_id)
python3 scripts/print_timeline.py /path/to/persisted-file.json

# 3. Drill into a specific span by name
SPAN="HTTP GET /api/users" python3 scripts/extract_span.py /path/to/persisted-file.json

# 4. Search for a keyword across span names, services, IDs
SEARCH="keyword" python3 scripts/search_spans.py /path/to/persisted-file.json

# 5. When the JSON shape looks unfamiliar
python3 scripts/show_structure.py /path/to/persisted-file.json

All scripts support MAX_LEN=N env var to control truncation (0 = unlimited).

Tree reconstruction (parent_span_id → span_id)

The flat span list is a tree. Each span carries:

  • trace_id — same on every span in the trace
  • span_id — this span's unique hex ID
  • parent_span_id — points to the parent's span_id (zero-padded hex 000…000 for the root)
  • is_root_span — convenience flag for the trace entry

To rebuild the tree:

  1. Spans where is_root_span is true (or parent_span_id == "00000000…") are root spans.
  2. Every other span is a child of the span whose span_id matches its parent_span_id.
  3. Group by parent_span_id, walk from each root downward.

scripts/print_timeline.py does this for you and prints a DFS-indented tree.

Investigation patterns

"Where is time going?"

  1. Every span from apm-trace-get carries self_time_nano — duration not covered by children. Sort by it: the top span is where wall-clock actually went. A parent with large self_time_nano is an uninstrumented gap (the work happened inside it, not in any recorded child).
  2. Run print_summary.py — it surfaces the top-5 slowest spans by duration_nano.
  3. For a noisy trace, run print_timeline.py and scan the indented durations — you can see whether time is dominated by one child span or fan-out across many.
  4. To dig into one slow span, SPAN="<name>" python3 scripts/extract_span.py FILE.
  5. For aggregate "which child dominates" questions use apm-spans-tree and read calls_per_parent_invocation — it separates a child that's slow per call from one that merely runs 20× per parent.

"Where did the error happen?"

  1. print_summary.py lists every span with status_code == 2 (Error). Each entry shows service, span name, and parent context.
  2. Walk up the tree from an error span via parent_span_id to see what request path led there.
  3. Error detail lives in each span's attributes map (e.g. exception.message, exception.type), which is returned in the trace payload — read it directly off the error span. apm-attribute-values-list is for discovering values across spans, not a prerequisite for reading one span's attributes.

"Did the request hit service X?"

  1. Run print_summary.py — it prints the set of services involved in the trace.
  2. If service X is missing, the request never reached it (or instrumentation is missing — check apm-services-list to confirm X has emitted spans recently at all).

"What's different about the bad spans?" (over-represented values)

  1. Scope to the bad population: filterGroup with status_code = Error, or a duration threshold.
  2. Discover candidate keys with apm-attributes-list — typical suspects: server.address, http.response.status_code, db.system, resource keys like k8s.pod.name / service.version.
  3. Run apm-attribute-breakdown per candidate key on the bad set. A value owning most of the count is the signature.
  4. Confirm over-representation: re-run without the bad-set filter (or compare error_count / count per row). A value at 95% of errors but 10% of traffic is the culprit; one at 95% of both is just volume.

"When did it spike?" (trends over time)

  1. apm-spans-sparkline with your filters → total counts per time bucket (zero-filled, ~50 adaptive buckets per window).
  2. The same call with statusCodes: [2] → error counts per bucket.
  3. Error rate per bucket = errors / total; the bucket where the ratio jumps is when the spike started.
  4. Zoom in: re-run with a narrower dateRange around that bucket, then pull raw spans via query-apm-spans.

"What does the latency distribution look like?"

  1. apm-spans-duration-histogram → trace counts per log-scale (1-2-5 series) duration bucket of the ROOT span.
  2. A second hump or a fat tail = a distinct slow population; note its bucket_ns range.
  3. Fetch the actual slow traces with query-apm-spans using a duration filter (nanoseconds) and orderBy: "duration".

"Did the fan-out look right?"

  1. print_timeline.py shows the indentation — wide trees mean parallel calls, deep trees mean sequential dependencies.
  2. Look for spans of kind Client (3) followed by matching Server (2) spans on the called service — that's a synchronous downstream call.

Searching by attribute (e.g. http.method=POST)

Each span carries an attributes map (span-level OTel attributes like http.method, db.statement) in the payload — so for a span you already have, just read it. Resource attributes (k8s labels, service.version) are not in the payload. To filter the whole dataset by an attribute:

  1. Use apm-attributes-list / apm-attribute-values-list to discover keys and values (resource attributes especially).
  2. Re-issue query-apm-spans with a filterGroup entry of type span_attribute or span_resource_attribute.

Constructing UI links

apm-trace-get and query-apm-spans return _posthogUrlalways surface this to the user so they can verify in the PostHog UI.

When presenting findings, include the relevant PostHog URL.

Finding traces

Use posthog:query-apm-spans to search and filter spans. Note this returns spans, not a tree — pass query.traceId or grab a trace_id from the results and feed it to apm-trace-get for the tree.

Discover before filtering

Before constructing filters, discover what's actually in the project:

  1. Confirm services exist — call apm-services-list to see which services have emitted spans.
  2. Find filterable attributes — call apm-attributes-list with attribute_type: "span" or "resource".
  3. Get actual values — call apm-attribute-values-list with a key to see the real values in use.

Only then construct query-apm-spans filters. Custom attributes vary per project and cannot be guessed.

By filters

posthog:query-apm-spans
{
  "query": {
    "serviceNames": ["api-gateway"],
    "dateRange": {"date_from": "-1h"},
    "filterGroup": [
      {"key": "http.status_code", "operator": "gt", "type": "span_attribute", "value": "499"}
    ]
  }
}

By trace ID (when known)

posthog:apm-trace-get
{
  "trace_id": "0123456789abcdef0123456789abcdef"
}

Common gotchas

  • Durations are nanoseconds. 1 second = 1_000_000_000. Filter values in query-apm-spans for duration are also nanoseconds.
  • status_code == 2 is Error. 0 is Unset, 1 is OK. Use OK to match {0, 1} in the UI filter.
  • kind is an integer 0–5: 0 Unspecified, 1 Internal, 2 Server, 3 Client, 4 Producer, 5 Consumer.
  • parent_span_id of a root span is "0000000000000000" (16 zero hex chars, matching the 8-byte span ID width — not the 16-byte trace ID width), not null.

Parsing large trace results

Trace tool results are JSON. When too large to read inline, Claude Code persists them to a file.

Persisted file format

[{ "type": "text", "text": "{\"results\": [...], \"_posthogUrl\": \"...\"}" }]

Every script in scripts/ unwraps this envelope before parsing.

Trace JSON structure

results (array of span dicts)
  └── each span:
        ├── uuid, trace_id, span_id, parent_span_id (hex strings)
        ├── name, kind (int 0–5), service_name
        ├── status_code (int 0–2), is_root_span (bool)
        ├── timestamp, end_time (ISO 8601)
        ├── duration_nano (int, nanoseconds)
        ├── attributes (map of span-level OTel attributes, e.g. db.statement, http.url)
        └── matched_filter (0/1 — 1 if this span matched the query-apm-spans filter, 0 if it
            only shares a trace with a match; always present, only meaningful from query-apm-spans)

Available scripts

ScriptPurposeUsage
print_summary.pyTrace metadata, services, slowest spans, errorspython3 scripts/print_summary.py FILE
print_timeline.pyDFS-indented tree from parent_span_id walkpython3 scripts/print_timeline.py FILE
extract_span.pyFull row + parent/children for spans matching a nameSPAN="name" python3 scripts/extract_span.py FILE
search_spans.pyFind a keyword across name, service_name, IDsSEARCH="kw" python3 scripts/search_spans.py FILE
show_structure.pyShow JSON keys and types without valuespython3 scripts/show_structure.py FILE

Tips

  • Always set dateRange on query-apm-spans — queries without a time range are slow. Default is -1h; widen only when needed.
  • Always include the _posthogUrl in your response so the user can click through.
  • Span-level attributes are in the apm-trace-get / query-apm-spans payload (each span's attributes map). Resource attributes are not — use apm-attributes-list (type resource) and apm-attribute-values-list for those.
  • is_root_span is the cheap way to find the trace entry — don't string-match 00000000….
  • For aggregates (p95 by operation, slowest children of a span), use apm-spans-aggregate for a flat view or apm-spans-tree for parent→child edges — don't reach for SQL.