authoring-log-alerts作者: posthog

Authoring an alert is a measurement problem, not a guessing problem. You are not trying to be exhaustive — you are trying to land thresholds that fire 0–3 times per week on real production patterns, on services that matter.

npx skills add https://github.com/posthog/skills --skill authoring-log-alerts

Authoring log alerts

Authoring an alert is a measurement problem, not a guessing problem. You are not trying to be exhaustive — you are trying to land thresholds that fire 0–3 times per week on real production patterns, on services that matter.

When to use this skill

  • The user asks to "set up alerts" / "suggest alerts" for their project.
  • The user wants to evaluate whether a service is producing alertable signal.
  • The user has just enabled log alerting and wants a starter set.

When not to use this skill

  • Tuning an alert that already exists — that's a different job (use posthog:logs-alerts-events-list to inspect fire/resolve cadence and posthog:logs-alerts-partial-update to adjust).
  • Investigating an active incident — pull rows with posthog:query-logs, don't author an alert mid-incident.

Tools

ToolJobWhere it fits
posthog:logs-servicesTop-25 services in window with log_count, error_count, error_rate, sparkline.Step 1 — triage.
posthog:logs-attributes-list / posthog:logs-attribute-values-listDiscover keys/values for narrower filters.Step 2, optional.
posthog:logs-count-rangesAdaptive time-bucketed counts for a filter.Step 3 — baseline.
posthog:logs-alerts-simulate-createReplay a draft config against -7d history with full state machine.Step 4 — validate.
posthog:logs-alerts-createPersist the alert.Step 5 — ship.
posthog:logs-alerts-destinations-createWire the alert to Slack or webhook.Step 5 — ship.

Do not call posthog:query-logs during authoring. You need distributions, not rows. Reserve posthog:query-logs for the very end if the user asks "show me a sample of what would have fired" — limit: 10 is plenty.

Workflow

1. Triage — pick candidate services

Call posthog:logs-services for the last 24h with no filters. The response is capped at 25 services and includes a sparkline, so it is small and bounded.

A service is a candidate when both are true:

  • log_count is non-trivial (≥ ~1k in 24h — quieter services produce too little signal to alert on).
  • error_rate is non-zero, or the user has named the service explicitly.

Skip services with high volume but error_rate == 0 unless the user wants a volume-shape alert (e.g. "warn me if api-gateway suddenly stops producing logs"). Volume-floor alerts use threshold_operator: below and need different reasoning — see references/volume-floor-alerts.md.

If the user names a service, treat it as a candidate even without error signal.

2. (Optional) Narrow the filter

If a service has many error sub-types, an alert on "all errors" is usually too broad. Use posthog:logs-attributes-list (try attribute_type: log) and posthog:logs-attribute-values-list to find a discriminator — common ones are http.status_code, error.type, k8s.container.name. Add the narrowing filter to your draft.

Keep it simple: one severity filter + one or two attribute filters is plenty. Multi-clause filters are harder to reason about and rarely improve precision.

3. Baseline — characterise the candidate over 7 days

Call posthog:logs-count-ranges with the candidate's filters, dateRange: { date_from: "-7d" }, and targetBuckets: 24 (one bucket ≈ 7h). The response gives you bucket counts.

Do not eyeball the percentiles or scale the threshold to the alert window manually. Pipe the count-ranges response into the helper script:

echo '<count-ranges JSON>' | python3 scripts/baseline_stats.py --window-minutes 5

The script returns:

{
  "n_buckets": 12,
  "bucket_minutes": 420.0,
  "alert_window_minutes": 5,
  "stats": { "p50": 12.0, "p95": 71.25, "p99": 126.25, "max": 140 },
  "suggested_threshold_count": 5,
  "rationale": "max(p99=126.25, median*3=36.0, floor=5) scaled from 420m bucket to 5m window",
  "health": []
}

Use suggested_threshold_count as your starting threshold. Read health:

health flagWhat it meansWhat to do
sparse:N_of_M_bucketsToo few non-empty buckets for a 7d baseline.Widen filter, extend to -30d, or skip.
emptyAll buckets are zero.Skip — no signal.
spikymax is 10×+ p95.Count-threshold alerts work well. Proceed.
flatp95p50.Be cautious — either no incidents in lookback, or the metric is too smooth. Try a longer lookback or skip.
[] (empty)Healthy distribution.Proceed.

4. Draft and simulate

Pick a starter draft from these defaults — see references/threshold-defaults.md for the reasoning:

SettingDefaultNotes
threshold_countsuggested_threshold_count from the scriptAlready scaled to the alert window.
threshold_operatoraboveUse below only for volume-floor alerts.
window_minutes5Allowed: 5, 10, 15, 30, 60. Must match what you passed to the script.
evaluation_periods3M in N-of-M.
datapoints_to_alarm2N in N-of-M. 2-of-3 reduces flap from a single noisy bucket.
cooldown_minutes30Minimum time between repeat fires.

Call posthog:logs-alerts-simulate-create with these settings and date_from: "-7d". The response gives you fire_count and resolve_count.

5. Iterate — three rounds, then ship or skip

Target: fire_count between 0 and ~3 over -7d. If outside the band:

OutcomeAdjustment
fire_count = 0 over 7d and the baseline was spikyLower threshold_count toward stats.p95 from the script, or drop to 1-of-2.
fire_count = 0 and the baseline was flatThe service has no alertable signal. Skip it; log why.
fire_count > 5Raise threshold_count toward stats.max from the script, or move to 3-of-5 for a smoother window.
fire_count is fine but resolve_count never matches fire_countCooldown is too long, or the underlying state is genuinely sticky. Acceptable for now.

When adjusting the threshold, read values from the script's stats block — never recompute percentiles by hand.

Cap iteration at 3 simulate calls per candidate. If you can't land in the band in 3 rounds, the metric is wrong — either the filter is too broad, the window is wrong, or the service genuinely doesn't have a threshold-shape signal. Note it and move on.

6. Ship — create + attach destination

Once a draft simulates cleanly:

  1. Call posthog:logs-alerts-create with the validated config. Use a name like <service> error rate (auto) so the user can see at a glance which alerts came from this skill.
  2. Call posthog:logs-alerts-destinations-create to wire it to a notification target. An alert with no destination is silent. Always confirm the channel name or webhook URL with the user before attaching — never wire an auto-generated alert to a production channel without explicit confirmation. If the user is unsure, suggest a low-traffic testing channel for the first few alerts.

If the user wants alerts created in enabled: false state for review-then-flip, pass enabled: false to -create and tell them how many drafts you produced.

Filter shape — required

The filters field on posthog:logs-alerts-create takes a subset of LogsViewerFilters and must contain at least one of:

  • severityLevels — list of ["trace","debug","info","warn","error","fatal"]
  • serviceNames — list of service name strings
  • filterGroup — property filter group

The same shape goes into posthog:logs-alerts-simulate-create's filters field. Match the simulate filters to the alert filters exactly — otherwise the simulation is testing a different alert than the one you ship.

Example minimum:

{
  "severityLevels": ["error", "fatal"],
  "serviceNames": ["api-gateway"]
}

Token-economy rules

  • One posthog:logs-services call at the start, not per-candidate.
  • One posthog:logs-count-ranges call per candidate at targetBuckets: 24. Don't go above 30 during authoring.
  • ≤ 3 posthog:logs-alerts-simulate-create calls per candidate.
  • Zero posthog:query-logs calls during the authoring loop.
  • Prefer reporting a small set of well-validated alerts over a long list of unvalidated drafts.

Output

Report what you did, in this shape:

  • For each shipped alert: name, filters, threshold, simulated fire_count over 7d, destination.
  • For each skipped candidate: service name + why (flat baseline, can't land threshold, low volume).
  • Total simulate calls made, total alerts created.

The user should be able to read this and decide whether to disable any drafts before they go live.

NotebookLM 网页导入器

一键将网页和 YouTube 视频导入 NotebookLM。超过 200,000 用户信赖。

安装 Chrome 扩展