monitoring-capture-service작성자: posthog
The capture service ( rust/capture/ ) is PostHog's Rust HTTP ingestion endpoint. It receives events from SDKs, applies quota/rate limits, and produces to Kafka. Three deployment roles run the same binary with different CAPTURE_MODE configs.
npx skills add https://github.com/posthog/posthog --skill monitoring-capture-serviceMonitoring the capture service with Grafana MCP
The capture service (rust/capture/) is PostHog's Rust HTTP ingestion endpoint.
It receives events from SDKs, applies quota/rate limits, and produces to Kafka.
Five deployments run the same binary with different CAPTURE_MODE configs, each in its own K8s namespace.
This skill teaches how to discover live metrics using the Grafana MCP tools rather than memorizing metric names that change as the code evolves.
Environment context
The Grafana MCP is connected to a single Grafana instance scoped to one environment. If the user hasn't specified, ask which environment they want to investigate:
- prod-us — US production (us-east-1)
- prod-eu — EU production (eu-central-1)
Most capture app metrics (e.g. capture_*, http_requests_*, envoy_cluster_*) are
environment-specific by virtue of which Grafana you're connected to — they don't carry
an environment label. MSK and CloudWatch metrics do carry environment labels but
are still scoped to the connected Grafana's AWS account.
Cross-environment comparison requires switching Grafana instances (not possible in one session).
Observability landscape
Capture spans seven telemetry domains. Each has a Grafana datasource and a discovery entry point.
| Domain | Datasource UID | Discovery tool | Scope filter |
|---|---|---|---|
| App metrics (VictoriaMetrics) | victoriametrics | list_prometheus_metric_names | regex: "capture_.*" |
| App metrics (realtime) | victoriametrics-realtime | same | same (lower retention, higher resolution) |
| Logs | P44D702D3E93867EC (Loki-logs) | list_loki_label_names | app=~"capture.*" |
| Profiling | pyroscope | list_pyroscope_profile_types | service_name="capture-analytics/capture-analytics" |
| Dashboards | n/a | search_dashboards | query "capture" or "ingestion" |
| CloudWatch (ElastiCache, MSK) | P034F075C744B399F | query_prometheus | environment="prod-us" |
| CloudWatch Root (prod-us only) | PAAE47F430CFD1449 | same | root account AWS metrics (does NOT exist in prod-eu) |
Stable waypoints
These facts change infrequently and are hard to discover dynamically.
Deployment roles
Each capture variant runs as a separate K8s deployment in its own namespace.
The primary scope labels are namespace and container (not role — that label
contains pod names and is not useful for filtering).
| Deployment | Namespace | capture_mode | Pipeline | Notes |
|---|---|---|---|---|
capture-analytics | capture-analytics | events | Main events | Highest volume; routes /e, /i/v0/e, etc. |
capture-ai | capture-ai | events | AI/LLM events | Routes /i/v0/ai; OTel on port 4318 |
capture-replay | capture-replay | recordings | Session recordings | Routes /s/; CAPTURE_MODE=recordings |
capture-mirrored | capture-mirrored | events | Mirror/canary | Not always running; same metrics as analytics |
capture-logs | capture-logs | — | Log ingestion | OTel logs on port 4318 |
All variants share the same Rust binary (ghcr.io/posthog/posthog/capture).
Scope capture metrics with namespace=~"capture-.*" or container=~"capture-.*".
For a single pipeline, scope by namespace (e.g., namespace="capture-analytics").
Envoy cluster naming
Envoy metrics use envoy_cluster_name to identify the upstream backend.
Pattern: posthog_{deployment}_{port}.
Capture-related clusters:
posthog_capture-analytics_3000, posthog_capture-ai_3000, posthog_capture-replay_3000,
posthog_capture-mirrored_3000, posthog_capture-logs_4318, posthog_capture-logs-canary_4318.
capture-replay also has a proxy-as-a-service cluster used in KEDA autoscaling:
proxy-as-a-service_capture-replay_3000.
Scope with: envoy_cluster_name=~"posthog_capture-.*".
Redis instance topology
Capture depends on up to three logical Redis instances,
plus one external instance at the Envoy layer (not in the capture binary).
None emit capture_redis_* metrics — Redis health is inferred from capture-side
metrics and CloudWatch ElastiCache metrics.
1. Primary Redis (REDIS_URL env var)
- ElastiCache:
posthog-solo(prod-us) orposthog-prod-redis-encripted(prod-eu; sic — typo in actual cluster name) - Both envs use a read-only endpoint for the token cache (
REDIS_READER_URL) - Backs: billing/quota limits (
CaptureQuotaLimiter), session replay overflow limiter - Capture metrics:
capture_billing_limits_loaded_tokens(bycache_key),capture_quota_limit_exceeded(byresource) - Quota resources:
events,exceptions,llm_events,recordings,survey_responses - Cache keys:
@posthog/quota-limits/{resource},@posthog/capture-overflow/replay
2. Global Rate Limiter Redis (GLOBAL_RATE_LIMIT_REDIS_URL, optional)
- ElastiCache:
capture-globalratelimit-{env}-redis(prod-us, prod-eu; not dev) - Backs: per-(token, distinct_id) sliding-window rate limiter
- Falls back to primary Redis when URL is unset
- Optional read replica:
GLOBAL_RATE_LIMIT_REDIS_READER_URL - Toggle:
GLOBAL_RATE_LIMIT_ENABLED(may be off in some envs during rollout) - Metrics:
global_rate_limiter_*(direct),capture_events_rerouted_overflow{reason="rate_limited"}(proxy signal) - CloudWatch cluster id:
capture-globalratelimit-prod-redis
3. Event Restrictions Redis (EVENT_RESTRICTIONS_REDIS_URL)
- ElastiCache:
ingestion-prod-redis(separate writable cluster in both envs) - Stores Django-synced ingestion restriction configs
- Falls back to primary Redis when URL is unset
- Capture metrics:
capture_event_restrictions_redis_fetch(labels:restriction_type,resultin success/not_found/error/parse_error),capture_event_restrictions_stale,capture_event_restrictions_loaded_count
4. Contour Rate Limit Redis (ratelimit-{env}-redis) — NOT in capture binary
- Per-IP DoS protection at the Envoy ingress layer, in front of capture
- Metrics:
ratelimit_service_*(label:domain="posthog")
Metric prefixes
Every prefix here can be discovered live with list_prometheus_metric_names
using datasourceUid: "victoriametrics" and regex: "<prefix>.*".
| Prefix | Domain | Scope label |
|---|---|---|
capture_* | App metrics (~80 metrics) | namespace, container |
http_requests_* | HTTP layer (shared) | namespace=~"capture-.*" |
capture_kafka_* | Kafka producer (17 metrics) | namespace, container |
capture_billing_* | Billing/quota tokens loaded | namespace, cache_key |
capture_event_restrictions_* | Event restrictions (6 metrics) | namespace, restriction_type |
capture_ai_otel_* | AI/OTel capture (12 metrics) | namespace="capture-ai" |
envoy_cluster_* | L7 proxy | envoy_cluster_name=~"posthog_capture-.*" |
aws_msk_* | MSK broker-side (JMX) | environment="prod-us" or "prod-eu" |
ratelimit_service_* | Contour rate limit | domain="posthog" |
overflow_redirect_* | Node.js ingestion overflow (downstream) | ingestion_pipeline |
kube_* / container_* | K8s resources | namespace=~"capture-.*", pod=~"capture-.*" |
Kafka topics
Topics capture produces to (discover live via topic label on capture_kafka_produce_avg_batch_size_bytes).
Partition counts are encoded in topic names and differ by env (EU generally has fewer partitions).
Capture writes to two different Kafka backing systems depending on the pipeline:
- MSK ingestion cluster — analytics events (main, overflow, historical, turbo), heatmaps, error tracking, client warnings
- WarpStream — session replay (
warpstream-replay-v2VC), logs (warpstream-logsVC), traces (warpstream-tracesVC)
| Topic (prod-us / prod-eu) | Backing | Pipeline |
|---|---|---|
ingestion-analytics-main-1024 / -512 | MSK | Main events |
ingestion-analytics-overflow-128 | MSK | Overflow (rate-limited / high-volume tokens) |
ingestion-analytics-historical-128 | MSK | Historical backfill events |
ingestion-analytics-turbo-1024 | MSK | General turbo (prod-us only) |
ingestion-heatmaps-main-128 | MSK | Heatmaps |
ingestion-errortracking-main-128 | MSK | Error tracking |
ingestion-errortracking-overflow-32 | MSK | Error tracking overflow |
ingestion-clientwarnings-main-16 / -32 | MSK | Client warnings |
ingestion-sessionreplay-main-512 / -256 | WarpStream | Session replay |
ingestion-sessionreplay-overflow-64 / -32 | WarpStream | Session replay overflow |
ingestion-logs | WarpStream | Log ingestion |
ingestion-traces | WarpStream | Traces ingestion |
ingestion-analytics-main-dlq (+ per-pipeline DLQ topics) | MSK | Dead letter queues |
Pyroscope services
| Service name | Deployment |
|---|---|
capture-analytics/capture-analytics | Main capture |
capture-ai/capture-ai | AI capture |
capture-replay/capture-replay | Replay capture |
capture-mirrored/capture-mirrored | Mirror/canary (when running) |
capture-logs/capture-logs | Logs capture |
Profile types: process_cpu:cpu:nanoseconds:cpu:nanoseconds,
wall:wall:nanoseconds:wall:nanoseconds,
memory:inuse_space:bytes:inuse_space:bytes,
memory:inuse_objects:count:inuse_space:bytes.
Grafana dashboards
| UID | Title |
|---|---|
capture | Capture (golden-chart backend overview) |
ddfkdj56ds11xce | Capture Golden (Synced folder) |
dffkdlee8ub5s0a | Ingestion - Capture Golden |
capture-3000-envoy-codes | Capture 3000 — Envoy Response Code Investigation |
ingestion-general | Cross-service ingestion overview |
ingestion-analytics | Ingestion - Analytics (per-pipeline breakdown) |
ingestion-reliability | Error rates and reliability signals |
ingestion-pipeline-performance | End-to-end pipeline latency |
b2348f37-f276-498e-b72e-7cc2b5ec1455 | New capture (legacy) |
contour | Envoy L7 proxy (set envoy_cluster_name=posthog_capture-analytics_3000) |
ingestion-session-recordings | Session Replay ingestion |
Discovery workflows
Prometheus / VictoriaMetrics
list_prometheus_metric_names—datasourceUid: "victoriametrics",regex: "capture_.*"to enumerate app metrics- Pick a metric, then
list_prometheus_label_namesscoped to it — see available dimensions list_prometheus_label_values— discover actual values for a label (e.g.labelName: "cause"oncapture_events_dropped_total)query_prometheuswith PromQL — always scope bynamespace(orcontainer) and set a time range
Loki (logs)
list_loki_label_names—datasourceUid: "P44D702D3E93867EC"(Loki-logs; do NOT use primary LokiP8E80F9AEF21F6940which 502s intermittently)list_loki_label_valuesforappornamespace— find capture containersquery_loki_logs— e.g.{app=~"capture.*"} |= "error"
Pyroscope (profiling)
list_pyroscope_profile_types—data_source_uid: "pyroscope"fetch_pyroscope_profile—matchers: '{service_name="capture-analytics/capture-analytics"}',profile_type: "process_cpu:cpu:nanoseconds:cpu:nanoseconds"
Dashboards
search_dashboards— query"capture"or"ingestion"get_dashboard_by_uid— use a known UID (e.g."capture") to get panel detailsget_dashboard_panel_queries— extract PromQL from existing panels
Redis / ElastiCache
- Capture-side: discover
capture_billing_*,capture_event_restrictions_*,capture_quota_*metrics in VictoriaMetrics - Infrastructure: CloudWatch datasource
P034F075C744B399Ffor ElastiCache (CPU, memory, connections, latency). Cluster IDs:capture-globalratelimit-prod-redis,posthog-solo(prod-us primary)
Key metric domains
Categories of what to look for. Discover specific metrics live using the prefixes above.
HTTP layer — request rate, latency distribution (p50/p99), active connections,
error rates by status code. Metrics: http_requests_* scoped by namespace, capture_active_connections.
Event lifecycle — the funnel from received to ingested to dropped/rerouted.
capture_events_received_total -> capture_events_ingested_total -> capture_events_dropped_total.
The cause label on drops has 20+ values (discover live). Key additions since the golden-chart
migration: event_restriction_drop, event_too_big, otel_quota_drop, oversize_event,
ai_opt_in, gathering, invalid_session, no_session_id, no_snapshot.
Rerouting: capture_events_rerouted_overflow with reason label
(rate_limited, force_limited, event_restriction).
Also: capture_events_rerouted_custom_topic for topic-redirect restrictions.
Kafka producer — broker connectivity (capture_kafka_any_brokers_down,
capture_kafka_broker_connected), queue saturation (_queue_depth / _queue_depth_limit),
produce RTT (capture_kafka_produce_rtt_latency_us by quantile and broker),
delivery errors (capture_kafka_produce_errors_total).
Billing and quota — capture_billing_limits_loaded_tokens by cache_key,
capture_quota_limit_exceeded by resource (events, exceptions, llm_events, recordings, survey_responses).
Event restrictions — capture_event_restrictions_* for Redis fetch health,
staleness, loaded count, applied restrictions by restriction_type
(drop_event, force_overflow, redirect_to_topic, skip_person_processing).
Envoy proxy — upstream latency, response codes (2xx/4xx/5xx), connection health,
circuit breakers (_open gauges), backend membership (healthy vs total),
timeouts, retries. Always filter: envoy_cluster_name=~"posthog_capture-.*".
For capture-replay, also check proxy-as-a-service_capture-replay_3000.
Contour rate limit — ratelimit_service_* for per-IP DoS protection.
ratelimit_service_rate_limit_over_limit = actively rate-limited IPs.
MSK broker-side — aws_msk_* JMX metrics for capture-analytics, capture-ai, and other MSK-backed topics.
Key signals: throttle time, network processor idle %, memory pool depletion, request queue size.
Both envs have a dedicated ingestion MSK cluster separate from the events cluster
(prod-us: c21; prod-eu: posthog-prod-eu-ingestion-2026-05-04).
WarpStream — warpstream_agent_* metrics for capture-replay, capture-logs, and traces.
These pipelines produce to in-cluster WarpStream agents, not MSK.
Key signals: warpstream_agent_control_plane_operation_counter (by operation),
warpstream_agent_file_cache_client_fetch_local_or_remote_counter (cache hit/miss ratio).
Dashboards: warpstream (Agent Overview), dbfj5c31spa1ogf (MSK vs WarpStream — Active Produce Topics).
US-only personal dashboards (not synced to EU): ws-coarse-lag-explore (Coarse Lag), 8e93b023-… (CH Consumer Lag).
Per-VC KMinion instances: kminion-warpstream-replay, kminion-warpstream-logs, kminion-warpstream-traces.
K8s resources — container_* and kube_* for CPU, memory, restarts, HPA state.
Scope: namespace=~"capture-.*", pod=~"capture-.*".
Investigation playbooks
See references/investigation-playbooks.md for step-by-step workflows for common questions: health checks, event loss, latency, Kafka backpressure, rate limiting, Redis, and cross-env comparison.