controlling-costs

por axiomhq

Analiza los patrones de consulta de Axiom para encontrar datos no utilizados, luego crea paneles y monitores para la optimización de costos. Úsalo cuando se te pida reducir costos de Axiom, encontrar datos no utilizados…

npx skills add https://github.com/axiomhq/skills --skill controlling-costs

Descargar ZIP GitHub

Axiom Cost Control

Dashboards, monitors, and waste identification for Axiom usage optimization.

Before You Start

Load required skills:
```
skill: axiom-sre
skill: building-dashboards
```
Building-dashboards provides: dashboard-list, dashboard-get, dashboard-create, dashboard-update, dashboard-delete
Find the audit dataset. Try axiom-audit first:
```
['axiom-audit']
| where _time > ago(1h)
| summarize count() by action
| where action in ('usageCalculated', 'runAPLQueryCost')
```
- If not found → ask user. Common names: axiom-audit-logs-view, audit-logs
- If found but no usageCalculated events → wrong dataset, ask user
Verify axiom-history access (required for Phase 4):
```
['axiom-history'] | where _time > ago(1h) | take 1
```
If not found, Phase 4 optimization will not work.
Confirm with user:
- Deployment name?
- Audit dataset name?
- Contract limit in TB/day? (required for Phase 3 monitors)
Replace <deployment> and <audit-dataset> in all commands below.

Tips:

Run any script with -h for full usage
Do NOT pipe script output to head or tail — causes SIGPIPE errors
Requires jq for JSON parsing
Use axiom-sre's axiom-query for ad-hoc APL, not direct CLI

Which Phases to Run

User request	Run these phases
"reduce costs" / "find waste"	0 → 1 → 4
"set up cost control"	0 → 1 → 2 → 3
"deploy dashboard"	0 → 2
"create monitors"	0 → 3
"check for drift"	0 only

Phase 0: Check Existing Setup

# Existing dashboard?
dashboard-list <deployment> | grep -i cost

# Existing monitors?
axiom-api <deployment> GET "/v2/monitors" | jq -r '.[] | select(.name | startswith("Cost Control:")) | "\(.id)\t\(.name)"'

If found, fetch with dashboard-get and compare to templates/dashboard.json for drift.

Phase 1: Discovery

scripts/baseline-stats -d <deployment> -a <audit-dataset>

Captures daily ingest stats and produces the Analysis Queue (needed for Phase 4).

Phase 2: Dashboard

scripts/deploy-dashboard -d <deployment> -a <audit-dataset>

Creates dashboard with: ingest trends, burn rate, projections, waste candidates, top users. See reference/dashboard-panels.md for details.

Phase 3: Monitors

Contract is required. You must have the contract limit from preflight step 4.

Step 1: List available notifiers

scripts/list-notifiers -d <deployment>

Present the list to the user and ask which notifier they want for cost alerts. If they don't want notifications, proceed without -n.

Step 2: Create monitors

scripts/create-monitors -d <deployment> -a <audit-dataset> -c <contract_tb> [-n <notifier_id>]

Creates 3 monitors:

Total Ingest Guard — alerts when daily ingest >1.2x contract OR 7-day avg grows >15% vs baseline
Per-Dataset Spike — robust z-score detection, alerts per dataset with attribution
Query Cost Spike — hardened z-score with 30d baseline, 5d exclusion gap, persistence-based gating (median_z > 3, p25_z > 2.5)

The spike monitors use notifyByGroup: true so each dataset triggers a separate alert.

See reference/monitor-strategy.md for threshold derivation.

Phase 4: Optimization

Get the Analysis Queue

Run scripts/baseline-stats if not already done. It outputs a prioritized list:

Priority	Meaning
P0⛔	Top 3 by ingest OR >10% of total — MANDATORY
P1	Never queried — strong drop candidate
P2	Rarely queried (Work/GB < 100) — likely waste

Work/GB = query cost (GB·ms) / ingest (GB). Lower = less value from data.

Analyze datasets in order

Work top-to-bottom. For each dataset:

Step 1: Column analysis

scripts/analyze-query-coverage -d <deployment> -D <dataset> -a <audit-dataset>

If 0 queries → recommend DROP, move to next.

Step 2: Field value analysis

Pick a field from suggested list (usually app, service, or kubernetes.labels.app):

scripts/analyze-query-coverage -d <deployment> -D <dataset> -a <audit-dataset> -f <field>

Note values with high volume but never queried (⚠️ markers).

Step 3: Handle empty values

If (empty) has >5% volume, you MUST drill down with alternative field (e.g., kubernetes.namespace_name).

Step 4: Record recommendation

For each dataset, note: name, ingest volume, Work/GB, top unqueried values, action (DROP/SAMPLE/KEEP), estimated savings.

Done when

All P0⛔ and P1 datasets analyzed. Then compile report using reference/analysis-report-template.md.

Cleanup

# Delete monitors
axiom-api <deployment> GET "/v2/monitors" | jq -r '.[] | select(.name | startswith("Cost Control:")) | "\(.id)\t\(.name)"'
axiom-api <deployment> DELETE "/v2/monitors/<id>"

# Delete dashboard
dashboard-list <deployment> | grep -i cost
dashboard-delete <deployment> <id>

Note: Running create-monitors twice creates duplicates. Delete existing monitors first if re-deploying.

Reference

Audit Dataset Fields

Field	Description
`action`	`usageCalculated` or `runAPLQueryCost`
`properties.hourly_ingest_bytes`	Hourly ingest in bytes
`properties.hourly_billable_query_gbms`	Hourly query cost
`properties.dataset`	Dataset name
`resource.id`	Org ID
`actor.email`	User email

Common Fields for Value Analysis

Dataset type	Primary field	Alternatives
Kubernetes logs	`kubernetes.labels.app`	`kubernetes.namespace_name`, `kubernetes.container_name`
Application logs	`app` or `service`	`level`, `logger`, `component`
Infrastructure	`host`	`region`, `instance`
Traces	`service.name`	`span.kind`, `http.route`

Units & Conversions

Scripts use TB/day
Dashboard filter uses GB/month

Contract	TB/day	GB/month
5 PB/month	167	5,000,000
10 PB/month	333	10,000,000
15 PB/month	500	15,000,000

Optimization Actions

Signal	Action
Work/GB = 0	Drop or stop ingesting
High-volume unqueried values	Sample or reduce log level
Empty values from system namespaces	Filter at ingest or accept
WoW spike	Check recent deploys