authoring-log-alerts作成者: posthog

Authoring an alert is a measurement problem, not a guessing problem. You are not trying to be exhaustive — you are trying to land thresholds that fire 0–3 times per week on real production patterns, on services that matter.

npx skills add https://github.com/posthog/skills --skill authoring-log-alerts

Authoring log alerts

Authoring an alert is a measurement problem, not a guessing problem. You are not trying to be exhaustive — you are trying to land thresholds that fire 0–3 times per week on real production patterns, on services that matter.

When to use this skill

  • The user asks to "set up alerts" / "suggest alerts" for their project.
  • The user wants to evaluate whether a service is producing alertable signal.
  • The user has just enabled log alerting and wants a starter set.

When not to use this skill

  • Tuning an alert that already exists — that's a different job (use posthog:logs-alerts-events-list to inspect fire/resolve cadence and posthog:logs-alerts-partial-update to adjust).
  • Investigating an active incident — pull rows with posthog:query-logs, don't author an alert mid-incident.

Tools

ToolJobWhere it fits
posthog:logs-servicesTop-25 services in window with log_count, error_count, error_rate, sparkline.Step 1 — triage.
posthog:logs-attributes-list / posthog:logs-attribute-values-listDiscover keys/values for narrower filters.Step 2, optional.
posthog:logs-count-rangesAdaptive time-bucketed counts for a filter.Step 3 — baseline.
posthog:logs-alerts-simulate-createReplay a draft config against -7d history with full state machine.Step 4 — validate.
posthog:logs-alerts-createPersist the alert.Step 5 — ship.
posthog:logs-alerts-destinations-createWire the alert to Slack or webhook.Step 5 — ship.

Do not call posthog:query-logs during authoring. You need distributions, not rows. Reserve posthog:query-logs for the very end if the user asks "show me a sample of what would have fired" — limit: 10 is plenty.

Workflow

1. Triage — pick candidate services

Call posthog:logs-services for the last 24h with no filters. The response is capped at 25 services and includes a sparkline, so it is small and bounded.

A service is a candidate when both are true:

  • log_count is non-trivial (≥ ~1k in 24h — quieter services produce too little signal to alert on).
  • error_rate is non-zero, or the user has named the service explicitly.

Skip services with high volume but error_rate == 0 unless the user wants a volume-shape alert (e.g. "warn me if api-gateway suddenly stops producing logs"). Volume-floor alerts use threshold_operator: below and need different reasoning — see references/volume-floor-alerts.md.

If the user names a service, treat it as a candidate even without error signal.

2. (Optional) Narrow the filter

If a service has many error sub-types, an alert on "all errors" is usually too broad. Use posthog:logs-attributes-list (try attribute_type: log) and posthog:logs-attribute-values-list to find a discriminator — common ones are http.status_code, error.type, k8s.container.name. Add the narrowing filter to your draft.

Keep it simple: one severity filter + one or two attribute filters is plenty. Multi-clause filters are harder to reason about and rarely improve precision.

3. Baseline — characterise the candidate over 7 days

Call posthog:logs-count-ranges with the candidate's filters, dateRange: { date_from: "-7d" }, and targetBuckets: 24 (one bucket ≈ 7h). The response gives you bucket counts.

Do not eyeball the percentiles or scale the threshold to the alert window manually. Pipe the count-ranges response into the helper script:

echo '<count-ranges JSON>' | python3 scripts/baseline_stats.py --window-minutes 5

The script returns:

{
  "n_buckets": 12,
  "bucket_minutes": 420.0,
  "alert_window_minutes": 5,
  "stats": { "p50": 12.0, "p95": 71.25, "p99": 126.25, "max": 140 },
  "suggested_threshold_count": 5,
  "rationale": "max(p99=126.25, median*3=36.0, floor=5) scaled from 420m bucket to 5m window",
  "health": []
}

Use suggested_threshold_count as your starting threshold. Read health:

health flagWhat it meansWhat to do
sparse:N_of_M_bucketsToo few non-empty buckets for a 7d baseline.Widen filter, extend to -30d, or skip.
emptyAll buckets are zero.Skip — no signal.
spikymax is 10×+ p95.Count-threshold alerts work well. Proceed.
flatp95p50.Be cautious — either no incidents in lookback, or the metric is too smooth. Try a longer lookback or skip.
[] (empty)Healthy distribution.Proceed.

4. Draft and simulate

Pick a starter draft from these defaults — see references/threshold-defaults.md for the reasoning:

SettingDefaultNotes
threshold_countsuggested_threshold_count from the scriptAlready scaled to the alert window.
threshold_operatoraboveUse below only for volume-floor alerts.
window_minutes5Allowed: 5, 10, 15, 30, 60. Must match what you passed to the script.
evaluation_periods3M in N-of-M.
datapoints_to_alarm2N in N-of-M. 2-of-3 reduces flap from a single noisy bucket.
cooldown_minutes30Minimum time between repeat fires.

Call posthog:logs-alerts-simulate-create with these settings and date_from: "-7d". The response gives you fire_count and resolve_count.

5. Iterate — three rounds, then ship or skip

Target: fire_count between 0 and ~3 over -7d. If outside the band:

OutcomeAdjustment
fire_count = 0 over 7d and the baseline was spikyLower threshold_count toward stats.p95 from the script, or drop to 1-of-2.
fire_count = 0 and the baseline was flatThe service has no alertable signal. Skip it; log why.
fire_count > 5Raise threshold_count toward stats.max from the script, or move to 3-of-5 for a smoother window.
fire_count is fine but resolve_count never matches fire_countCooldown is too long, or the underlying state is genuinely sticky. Acceptable for now.

When adjusting the threshold, read values from the script's stats block — never recompute percentiles by hand.

Cap iteration at 3 simulate calls per candidate. If you can't land in the band in 3 rounds, the metric is wrong — either the filter is too broad, the window is wrong, or the service genuinely doesn't have a threshold-shape signal. Note it and move on.

6. Ship — create + attach destination

Once a draft simulates cleanly:

  1. Call posthog:logs-alerts-create with the validated config. Use a name like <service> error rate (auto) so the user can see at a glance which alerts came from this skill.
  2. Call posthog:logs-alerts-destinations-create to wire it to a notification target. An alert with no destination is silent. Always confirm the channel name or webhook URL with the user before attaching — never wire an auto-generated alert to a production channel without explicit confirmation. If the user is unsure, suggest a low-traffic testing channel for the first few alerts.

If the user wants alerts created in enabled: false state for review-then-flip, pass enabled: false to -create and tell them how many drafts you produced.

Filter shape — required

The filters field on posthog:logs-alerts-create takes a subset of LogsViewerFilters and must contain at least one of:

  • severityLevels — list of ["trace","debug","info","warn","error","fatal"]
  • serviceNames — list of service name strings
  • filterGroup — property filter group

The same shape goes into posthog:logs-alerts-simulate-create's filters field. Match the simulate filters to the alert filters exactly — otherwise the simulation is testing a different alert than the one you ship.

Example minimum:

{
  "severityLevels": ["error", "fatal"],
  "serviceNames": ["api-gateway"]
}

Token-economy rules

  • One posthog:logs-services call at the start, not per-candidate.
  • One posthog:logs-count-ranges call per candidate at targetBuckets: 24. Don't go above 30 during authoring.
  • ≤ 3 posthog:logs-alerts-simulate-create calls per candidate.
  • Zero posthog:query-logs calls during the authoring loop.
  • Prefer reporting a small set of well-validated alerts over a long list of unvalidated drafts.

Output

Report what you did, in this shape:

  • For each shipped alert: name, filters, threshold, simulated fire_count over 7d, destination.
  • For each skipped candidate: service name + why (flat baseline, can't land threshold, low volume).
  • Total simulate calls made, total alerts created.

The user should be able to read this and decide whether to disable any drafts before they go live.

NotebookLM Webインポーター

ワンクリックでWebページとYouTube動画をNotebookLMにインポート。200,000人以上のユーザーが利用中。

Chrome拡張機能をインストール