find-traces

โดย axiomhq

วิเคราะห์ distributed traces ของ OpenTelemetry จาก Axiom ใช้เมื่อตรวจสอบ trace ID, ค้นหา traces ตามเกณฑ์ (ข้อผิดพลาด, ความหน่วง, บริการ) หรือกำลังดีบัก…

npx skills add https://github.com/axiomhq/cli --skill find-traces

ดาวน์โหลด ZIP GitHub

Trace Analysis

Analyze OpenTelemetry distributed traces to identify errors, latency issues, and root causes.

Arguments

When invoked with a trace ID (e.g., /find-traces abc123...), it's available as $ARGUMENTS.

Trace Dataset Discovery

First, find trace datasets:

axiom dataset list -f json

Look for datasets containing trace data (often named *traces*, *spans*, or otel-*).

Schema Discovery

Always verify field names first:

axiom query "['<trace-dataset>'] | getschema" --start-time -1h

Common Operations

Get Trace by ID

axiom query "['<dataset>']
| where trace_id == '<TRACE_ID>'
| sort by _time asc
| limit 100" --start-time -1h -f json

Find Error Traces

axiom query "['<dataset>']
| where _time >= ago(1h)
| where error == true
| extend error = coalesce(ensure_field(\"error\", typeof(bool)), false)
| summarize
    start_time = min(_time),
    total_duration = max(duration),
    span_count = count(),
    error_count = countif(error),
    services = make_set(['service.name']),
    root_operation = arg_min(_time, name)
  by trace_id
| sort by start_time desc
| limit 20" --start-time -1h -f json

Find Slow Traces

axiom query "['<dataset>']
| where _time >= ago(1h)
| where duration >= 1000000000
| summarize
    start_time = min(_time),
    total_duration = max(duration),
    span_count = count(),
    services = make_set(['service.name'])
  by trace_id
| sort by total_duration desc
| limit 20" --start-time -1h -f json

Find Traces by Service

axiom query "['<dataset>']
| where _time >= ago(1h)
| where ['service.name'] == '<SERVICE>'
| summarize
    start_time = min(_time),
    total_duration = max(duration),
    span_count = count(),
    error_count = countif(error == true)
  by trace_id
| sort by start_time desc
| limit 20" --start-time -1h -f json

Error Spans in Trace

axiom query "['<dataset>']
| where trace_id == '<TRACE_ID>'
| where error == true
| project _time, ['service.name'], name, duration, ['status.message']" --start-time -1h -f json

Critical Path Analysis

axiom query "['<dataset>']
| where trace_id == '<TRACE_ID>'
| project span_id, parent_span_id, ['service.name'], name, duration, error
| sort by duration desc" --start-time -1h -f json

OTel Field Reference

Field	Bracket?	Description
`trace_id`	No	32-char trace identifier
`span_id`	No	16-char span identifier
`parent_span_id`	No	Parent span (empty for root)
`name`	No	Operation name
`duration`	No	Duration in nanoseconds
`kind`	No	CLIENT, SERVER, INTERNAL, PRODUCER, CONSUMER
`error`	No	Boolean error flag
`['service.name']`	Yes	Service identifier
`['status.code']`	Yes	OK, ERROR, or nil
`['status.message']`	Yes	Error description
`['scope.name']`	Yes	Instrumentation library

Duration Conversion

OTel durations are in nanoseconds:

Human	Nanoseconds	Filter
1 ms	1,000,000	`duration >= 1000000`
100 ms	100,000,000	`duration >= 100000000`
1 s	1,000,000,000	`duration >= 1000000000`

Convert for display:

| extend duration_ms = duration / 1000000.0

Custom Attributes

Non-standard span attributes are stored in attributes.custom map:

// Filter by custom attribute
| where ['attributes.custom']['user_id'] == "123"

// Aggregation requires explicit cast
| summarize count() by tostring(['attributes.custom']['tenant'])

Without tostring(), aggregations fail with "grouping by field of type unknown".

Codebase Correlation

When working in a repository that matches the traced service, correlate trace data with source code to identify root causes.

Mapping Trace Data to Code

Extract package/module path from ['scope.name']
- Contains the instrumentation library or package path
- Strip the module prefix to get the local path
- Example: github.com/org/repo/pkg/auth → pkg/auth
Find code from operation name
- The name field often contains function names or HTTP routes
- Search the codebase for matching handlers, functions, or endpoints
Trace the call chain
- Follow parent-child span relationships
- Map each span to its corresponding code location
- Identify where errors originate and propagate

Note: Codebase correlation is optional. Proceed with trace-only analysis if code is unavailable or doesn't match the traced services.

Output Format

When analyzing a trace, provide:

## Trace Summary
- **Trace ID:** <id>
- **Duration:** <human-readable>
- **Services:** <list>
- **Outcome:** success/failure

## Sequence of Events
1. <Service> - <operation> (<duration>)
2. <Service> - <operation> (<duration>) ⚠️ ERROR
...

## Error Analysis
<What failed, when, why>

## Root Cause
<Deepest error and explanation>

## Codebase Locations (if applicable)
- **Service:** <service.name>
- **Package:** <scope.name>
- **Files:** <specific files to investigate>

## Recommended Actions
1. <Specific action>
2. <What to investigate next>

When NOT to Use

Metrics analysis: Traces are for request flow; use logs/metrics skills for aggregated performance data
Non-OTel data: This skill assumes OpenTelemetry field conventions (trace_id, span_id, etc.)
Known trace structure: If you already have the query, run it directly without invoking this skill
Alerting on trace patterns: Use Axiom Monitors for continuous alerting