exploring-apm-traces
PostHog OpenTelemetry से वितरित ट्रेस कैप्चर करता है। प्रत्येक ट्रेस स्पैन का एक पेड़ होता है जो सेवाओं के माध्यम से एक अनुरोध के पथ का प्रतिनिधित्व करता है।
npx skills add https://github.com/posthog/ai-plugin --skill exploring-apm-tracesExploring APM traces (OpenTelemetry spans)
PostHog captures distributed traces from OpenTelemetry. Each trace is a tree of spans representing a request's path through services.
Disambiguation: This skill is for APM / OpenTelemetry traces. Do not confuse with AI observability traces (agent/model $ai_* events) or logs (posthog:query-logs, posthog:logs-*).
Available tools
| Tool | Purpose |
|---|---|
posthog:query-apm-spans | Search and filter spans (compact list view) |
posthog:apm-trace-get | Get the full span list for one hex trace_id |
posthog:apm-spans-aggregate | Per-operation aggregates (count, p50/p95, errors) |
posthog:apm-spans-tree | Call-tree aggregates per (parent, child) edge |
posthog:apm-spans-count | Scalar span count — cheap filter pre-flight |
posthog:apm-spans-sparkline | Span counts over time (zero-filled time series) |
posthog:apm-spans-duration-histogram | Trace counts per log-scale duration bucket |
posthog:apm-attribute-breakdown | Span counts grouped by one attribute's value |
posthog:apm-services-list | List distinct service names |
posthog:apm-attributes-list | List span or resource attribute keys |
posthog:apm-attribute-values-list | List values for a specific attribute key |
See references/spans-and-fields.md for the response schema and the kind/status_code enums.
Workflow: debug a trace from a URL
Step 1 — Fetch the trace
posthog:apm-trace-get
{
"trace_id": "<hex_trace_id>"
}
The response is { results: [span, span, …] } — a flat list of every span in the trace.
The list can be very large for fan-out request flows; when it exceeds the inline limit, Claude Code auto-persists it to a file.
From the result you get:
- Every span with
name,service_name,kind,status_code,parent_span_id,duration_nano,is_root_span - The
_posthogUrl— always include this in your response so the user can click through to the UI
Step 2 — Parse large results with scripts
When the result is persisted to a file (traces with hundreds of spans across services), use the parsing scripts to explore it.
Start with the summary to get the full picture, then drill into specifics:
# 1. Overview: services, span count, slowest spans, errors
python3 scripts/print_summary.py /path/to/persisted-file.json
# 2. Indented chronological tree (DFS by parent_span_id)
python3 scripts/print_timeline.py /path/to/persisted-file.json
# 3. Drill into a specific span by name
SPAN="HTTP GET /api/users" python3 scripts/extract_span.py /path/to/persisted-file.json
# 4. Search for a keyword across span names, services, IDs
SEARCH="keyword" python3 scripts/search_spans.py /path/to/persisted-file.json
# 5. When the JSON shape looks unfamiliar
python3 scripts/show_structure.py /path/to/persisted-file.json
All scripts support MAX_LEN=N env var to control truncation (0 = unlimited).
Tree reconstruction (parent_span_id → span_id)
The flat span list is a tree. Each span carries:
trace_id— same on every span in the tracespan_id— this span's unique hex IDparent_span_id— points to the parent'sspan_id(zero-padded hex000…000for the root)is_root_span— convenience flag for the trace entry
To rebuild the tree:
- Spans where
is_root_spanis true (orparent_span_id == "00000000…") are root spans. - Every other span is a child of the span whose
span_idmatches itsparent_span_id. - Group by
parent_span_id, walk from each root downward.
scripts/print_timeline.py does this for you and prints a DFS-indented tree.
Investigation patterns
"Where is time going?"
- Every span from
apm-trace-getcarriesself_time_nano— duration not covered by children. Sort by it: the top span is where wall-clock actually went. A parent with largeself_time_nanois an uninstrumented gap (the work happened inside it, not in any recorded child). - Run
print_summary.py— it surfaces the top-5 slowest spans byduration_nano. - For a noisy trace, run
print_timeline.pyand scan the indented durations — you can see whether time is dominated by one child span or fan-out across many. - To dig into one slow span,
SPAN="<name>" python3 scripts/extract_span.py FILE. - For aggregate "which child dominates" questions use
apm-spans-treeand readcalls_per_parent_invocation— it separates a child that's slow per call from one that merely runs 20× per parent.
"Where did the error happen?"
print_summary.pylists every span withstatus_code == 2(Error). Each entry shows service, span name, and parent context.- Walk up the tree from an error span via
parent_span_idto see what request path led there. - Error detail lives in each span's
attributesmap (e.g.exception.message,exception.type), which is returned in the trace payload — read it directly off the error span.apm-attribute-values-listis for discovering values across spans, not a prerequisite for reading one span's attributes.
"Did the request hit service X?"
- Run
print_summary.py— it prints the set of services involved in the trace. - If service X is missing, the request never reached it (or instrumentation is missing — check
apm-services-listto confirm X has emitted spans recently at all).
"What's different about the bad spans?" (over-represented values)
- Scope to the bad population:
filterGroupwithstatus_code = Error, or adurationthreshold. - Discover candidate keys with
apm-attributes-list— typical suspects:server.address,http.response.status_code,db.system, resource keys likek8s.pod.name/service.version. - Run
apm-attribute-breakdownper candidate key on the bad set. A value owning most of thecountis the signature. - Confirm over-representation: re-run without the bad-set filter (or compare
error_count / countper row). A value at 95% of errors but 10% of traffic is the culprit; one at 95% of both is just volume.
"When did it spike?" (trends over time)
apm-spans-sparklinewith your filters → total counts per time bucket (zero-filled, ~50 adaptive buckets per window).- The same call with
statusCodes: [2]→ error counts per bucket. - Error rate per bucket = errors / total; the bucket where the ratio jumps is when the spike started.
- Zoom in: re-run with a narrower
dateRangearound that bucket, then pull raw spans viaquery-apm-spans.
"What does the latency distribution look like?"
apm-spans-duration-histogram→ trace counts per log-scale (1-2-5 series) duration bucket of the ROOT span.- A second hump or a fat tail = a distinct slow population; note its
bucket_nsrange. - Fetch the actual slow traces with
query-apm-spansusing adurationfilter (nanoseconds) andorderBy: "duration".
"Did the fan-out look right?"
print_timeline.pyshows the indentation — wide trees mean parallel calls, deep trees mean sequential dependencies.- Look for spans of kind
Client(3) followed by matchingServer(2) spans on the called service — that's a synchronous downstream call.
Searching by attribute (e.g. http.method=POST)
Each span carries an attributes map (span-level OTel attributes like http.method, db.statement) in the payload — so for a span you already have, just read it. Resource attributes (k8s labels, service.version) are not in the payload. To filter the whole dataset by an attribute:
- Use
apm-attributes-list/apm-attribute-values-listto discover keys and values (resource attributes especially). - Re-issue
query-apm-spanswith afilterGroupentry of typespan_attributeorspan_resource_attribute.
Constructing UI links
apm-trace-get and query-apm-spans return _posthogUrl — always surface this to the user so they can verify in the PostHog UI.
When presenting findings, include the relevant PostHog URL.
Finding traces
Use posthog:query-apm-spans to search and filter spans. Note this returns spans, not a tree — pass query.traceId or grab a trace_id from the results and feed it to apm-trace-get for the tree.
Discover before filtering
Before constructing filters, discover what's actually in the project:
- Confirm services exist — call
apm-services-listto see which services have emitted spans. - Find filterable attributes — call
apm-attributes-listwithattribute_type: "span"or"resource". - Get actual values — call
apm-attribute-values-listwith a key to see the real values in use.
Only then construct query-apm-spans filters. Custom attributes vary per project and cannot be guessed.
By filters
posthog:query-apm-spans
{
"query": {
"serviceNames": ["api-gateway"],
"dateRange": {"date_from": "-1h"},
"filterGroup": [
{"key": "http.status_code", "operator": "gt", "type": "span_attribute", "value": "499"}
]
}
}
By trace ID (when known)
posthog:apm-trace-get
{
"trace_id": "0123456789abcdef0123456789abcdef"
}
Common gotchas
- Durations are nanoseconds. 1 second =
1_000_000_000. Filter values inquery-apm-spansfordurationare also nanoseconds. status_code == 2is Error.0is Unset,1is OK. UseOKto match{0, 1}in the UI filter.kindis an integer 0–5: 0 Unspecified, 1 Internal, 2 Server, 3 Client, 4 Producer, 5 Consumer.parent_span_idof a root span is"0000000000000000"(16 zero hex chars, matching the 8-byte span ID width — not the 16-byte trace ID width), not null.
Parsing large trace results
Trace tool results are JSON. When too large to read inline, Claude Code persists them to a file.
Persisted file format
[{ "type": "text", "text": "{\"results\": [...], \"_posthogUrl\": \"...\"}" }]
Every script in scripts/ unwraps this envelope before parsing.
Trace JSON structure
results (array of span dicts)
└── each span:
├── uuid, trace_id, span_id, parent_span_id (hex strings)
├── name, kind (int 0–5), service_name
├── status_code (int 0–2), is_root_span (bool)
├── timestamp, end_time (ISO 8601)
├── duration_nano (int, nanoseconds)
├── attributes (map of span-level OTel attributes, e.g. db.statement, http.url)
└── matched_filter (0/1 — 1 if this span matched the query-apm-spans filter, 0 if it
only shares a trace with a match; always present, only meaningful from query-apm-spans)
Available scripts
| Script | Purpose | Usage |
|---|---|---|
print_summary.py | Trace metadata, services, slowest spans, errors | python3 scripts/print_summary.py FILE |
print_timeline.py | DFS-indented tree from parent_span_id walk | python3 scripts/print_timeline.py FILE |
extract_span.py | Full row + parent/children for spans matching a name | SPAN="name" python3 scripts/extract_span.py FILE |
search_spans.py | Find a keyword across name, service_name, IDs | SEARCH="kw" python3 scripts/search_spans.py FILE |
show_structure.py | Show JSON keys and types without values | python3 scripts/show_structure.py FILE |
Tips
- Always set
dateRangeonquery-apm-spans— queries without a time range are slow. Default is-1h; widen only when needed. - Always include the
_posthogUrlin your response so the user can click through. - Span-level attributes are in the
apm-trace-get/query-apm-spanspayload (each span'sattributesmap). Resource attributes are not — useapm-attributes-list(typeresource) andapm-attribute-values-listfor those. is_root_spanis the cheap way to find the trace entry — don't string-match00000000….- For aggregates (p95 by operation, slowest children of a span), use
apm-spans-aggregatefor a flat view orapm-spans-treefor parent→child edges — don't reach for SQL.