find-traces
Phân tích các trace phân tán OpenTelemetry từ Axiom. Sử dụng khi điều tra một trace ID, tìm trace theo tiêu chí (lỗi, độ trễ, dịch vụ), hoặc gỡ lỗi…
npx skills add https://github.com/axiomhq/cli --skill find-tracesTrace Analysis
Analyze OpenTelemetry distributed traces to identify errors, latency issues, and root causes.
Arguments
When invoked with a trace ID (e.g., /find-traces abc123...), it's available as $ARGUMENTS.
Trace Dataset Discovery
First, find trace datasets:
axiom dataset list -f json
Look for datasets containing trace data (often named *traces*, *spans*, or otel-*).
Schema Discovery
Always verify field names first:
axiom query "['<trace-dataset>'] | getschema" --start-time -1h
Common Operations
Get Trace by ID
axiom query "['<dataset>']
| where trace_id == '<TRACE_ID>'
| sort by _time asc
| limit 100" --start-time -1h -f json
Find Error Traces
axiom query "['<dataset>']
| where _time >= ago(1h)
| where error == true
| extend error = coalesce(ensure_field(\"error\", typeof(bool)), false)
| summarize
start_time = min(_time),
total_duration = max(duration),
span_count = count(),
error_count = countif(error),
services = make_set(['service.name']),
root_operation = arg_min(_time, name)
by trace_id
| sort by start_time desc
| limit 20" --start-time -1h -f json
Find Slow Traces
axiom query "['<dataset>']
| where _time >= ago(1h)
| where duration >= 1000000000
| summarize
start_time = min(_time),
total_duration = max(duration),
span_count = count(),
services = make_set(['service.name'])
by trace_id
| sort by total_duration desc
| limit 20" --start-time -1h -f json
Find Traces by Service
axiom query "['<dataset>']
| where _time >= ago(1h)
| where ['service.name'] == '<SERVICE>'
| summarize
start_time = min(_time),
total_duration = max(duration),
span_count = count(),
error_count = countif(error == true)
by trace_id
| sort by start_time desc
| limit 20" --start-time -1h -f json
Error Spans in Trace
axiom query "['<dataset>']
| where trace_id == '<TRACE_ID>'
| where error == true
| project _time, ['service.name'], name, duration, ['status.message']" --start-time -1h -f json
Critical Path Analysis
axiom query "['<dataset>']
| where trace_id == '<TRACE_ID>'
| project span_id, parent_span_id, ['service.name'], name, duration, error
| sort by duration desc" --start-time -1h -f json
OTel Field Reference
| Field | Bracket? | Description |
|---|---|---|
trace_id | No | 32-char trace identifier |
span_id | No | 16-char span identifier |
parent_span_id | No | Parent span (empty for root) |
name | No | Operation name |
duration | No | Duration in nanoseconds |
kind | No | CLIENT, SERVER, INTERNAL, PRODUCER, CONSUMER |
error | No | Boolean error flag |
['service.name'] | Yes | Service identifier |
['status.code'] | Yes | OK, ERROR, or nil |
['status.message'] | Yes | Error description |
['scope.name'] | Yes | Instrumentation library |
Duration Conversion
OTel durations are in nanoseconds:
| Human | Nanoseconds | Filter |
|---|---|---|
| 1 ms | 1,000,000 | duration >= 1000000 |
| 100 ms | 100,000,000 | duration >= 100000000 |
| 1 s | 1,000,000,000 | duration >= 1000000000 |
Convert for display:
| extend duration_ms = duration / 1000000.0
Custom Attributes
Non-standard span attributes are stored in attributes.custom map:
// Filter by custom attribute
| where ['attributes.custom']['user_id'] == "123"
// Aggregation requires explicit cast
| summarize count() by tostring(['attributes.custom']['tenant'])
Without tostring(), aggregations fail with "grouping by field of type unknown".
Codebase Correlation
When working in a repository that matches the traced service, correlate trace data with source code to identify root causes.
Mapping Trace Data to Code
-
Extract package/module path from
['scope.name']- Contains the instrumentation library or package path
- Strip the module prefix to get the local path
- Example:
github.com/org/repo/pkg/auth→pkg/auth
-
Find code from operation name
- The
namefield often contains function names or HTTP routes - Search the codebase for matching handlers, functions, or endpoints
- The
-
Trace the call chain
- Follow parent-child span relationships
- Map each span to its corresponding code location
- Identify where errors originate and propagate
Note: Codebase correlation is optional. Proceed with trace-only analysis if code is unavailable or doesn't match the traced services.
Output Format
When analyzing a trace, provide:
## Trace Summary
- **Trace ID:** <id>
- **Duration:** <human-readable>
- **Services:** <list>
- **Outcome:** success/failure
## Sequence of Events
1. <Service> - <operation> (<duration>)
2. <Service> - <operation> (<duration>) ⚠️ ERROR
...
## Error Analysis
<What failed, when, why>
## Root Cause
<Deepest error and explanation>
## Codebase Locations (if applicable)
- **Service:** <service.name>
- **Package:** <scope.name>
- **Files:** <specific files to investigate>
## Recommended Actions
1. <Specific action>
2. <What to investigate next>
When NOT to Use
- Metrics analysis: Traces are for request flow; use logs/metrics skills for aggregated performance data
- Non-OTel data: This skill assumes OpenTelemetry field conventions (trace_id, span_id, etc.)
- Known trace structure: If you already have the query, run it directly without invoking this skill
- Alerting on trace patterns: Use Axiom Monitors for continuous alerting
APL Reference
For query syntax, invoke the axiom-apl skill which provides trace analysis patterns and duration unit guidance.