monitoring-capture-service작성자: posthog

The capture service ( rust/capture/ ) is PostHog's Rust HTTP ingestion endpoint. It receives events from SDKs, applies quota/rate limits, and produces to Kafka. Three deployment roles run the same binary with different CAPTURE_MODE configs.

npx skills add https://github.com/posthog/posthog --skill monitoring-capture-service

Monitoring the capture service with Grafana MCP

The capture service (rust/capture/) is PostHog's Rust HTTP ingestion endpoint. It receives events from SDKs, applies quota/rate limits, and produces to Kafka. Five deployments run the same binary with different CAPTURE_MODE configs, each in its own K8s namespace.

This skill teaches how to discover live metrics using the Grafana MCP tools rather than memorizing metric names that change as the code evolves.

Environment context

The Grafana MCP is connected to a single Grafana instance scoped to one environment. If the user hasn't specified, ask which environment they want to investigate:

  • prod-us — US production (us-east-1)
  • prod-eu — EU production (eu-central-1)

Most capture app metrics (e.g. capture_*, http_requests_*, envoy_cluster_*) are environment-specific by virtue of which Grafana you're connected to — they don't carry an environment label. MSK and CloudWatch metrics do carry environment labels but are still scoped to the connected Grafana's AWS account.

Cross-environment comparison requires switching Grafana instances (not possible in one session).

Observability landscape

Capture spans seven telemetry domains. Each has a Grafana datasource and a discovery entry point.

DomainDatasource UIDDiscovery toolScope filter
App metrics (VictoriaMetrics)victoriametricslist_prometheus_metric_namesregex: "capture_.*"
App metrics (realtime)victoriametrics-realtimesamesame (lower retention, higher resolution)
LogsP44D702D3E93867EC (Loki-logs)list_loki_label_namesapp=~"capture.*"
Profilingpyroscopelist_pyroscope_profile_typesservice_name="capture-analytics/capture-analytics"
Dashboardsn/asearch_dashboardsquery "capture" or "ingestion"
CloudWatch (ElastiCache, MSK)P034F075C744B399Fquery_prometheusenvironment="prod-us"
CloudWatch Root (prod-us only)PAAE47F430CFD1449sameroot account AWS metrics (does NOT exist in prod-eu)

Stable waypoints

These facts change infrequently and are hard to discover dynamically.

Deployment roles

Each capture variant runs as a separate K8s deployment in its own namespace. The primary scope labels are namespace and container (not role — that label contains pod names and is not useful for filtering).

DeploymentNamespacecapture_modePipelineNotes
capture-analyticscapture-analyticseventsMain eventsHighest volume; routes /e, /i/v0/e, etc.
capture-aicapture-aieventsAI/LLM eventsRoutes /i/v0/ai; OTel on port 4318
capture-replaycapture-replayrecordingsSession recordingsRoutes /s/; CAPTURE_MODE=recordings
capture-mirroredcapture-mirroredeventsMirror/canaryNot always running; same metrics as analytics
capture-logscapture-logsLog ingestionOTel logs on port 4318

All variants share the same Rust binary (ghcr.io/posthog/posthog/capture).

Scope capture metrics with namespace=~"capture-.*" or container=~"capture-.*". For a single pipeline, scope by namespace (e.g., namespace="capture-analytics").

Envoy cluster naming

Envoy metrics use envoy_cluster_name to identify the upstream backend. Pattern: posthog_{deployment}_{port}.

Capture-related clusters: posthog_capture-analytics_3000, posthog_capture-ai_3000, posthog_capture-replay_3000, posthog_capture-mirrored_3000, posthog_capture-logs_4318, posthog_capture-logs-canary_4318.

capture-replay also has a proxy-as-a-service cluster used in KEDA autoscaling: proxy-as-a-service_capture-replay_3000.

Scope with: envoy_cluster_name=~"posthog_capture-.*".

Redis instance topology

Capture depends on up to three logical Redis instances, plus one external instance at the Envoy layer (not in the capture binary). None emit capture_redis_* metrics — Redis health is inferred from capture-side metrics and CloudWatch ElastiCache metrics.

1. Primary Redis (REDIS_URL env var)

  • ElastiCache: posthog-solo (prod-us) or posthog-prod-redis-encripted (prod-eu; sic — typo in actual cluster name)
  • Both envs use a read-only endpoint for the token cache (REDIS_READER_URL)
  • Backs: billing/quota limits (CaptureQuotaLimiter), session replay overflow limiter
  • Capture metrics: capture_billing_limits_loaded_tokens (by cache_key), capture_quota_limit_exceeded (by resource)
  • Quota resources: events, exceptions, llm_events, recordings, survey_responses
  • Cache keys: @posthog/quota-limits/{resource}, @posthog/capture-overflow/replay

2. Global Rate Limiter Redis (GLOBAL_RATE_LIMIT_REDIS_URL, optional)

  • ElastiCache: capture-globalratelimit-{env}-redis (prod-us, prod-eu; not dev)
  • Backs: per-(token, distinct_id) sliding-window rate limiter
  • Falls back to primary Redis when URL is unset
  • Optional read replica: GLOBAL_RATE_LIMIT_REDIS_READER_URL
  • Toggle: GLOBAL_RATE_LIMIT_ENABLED (may be off in some envs during rollout)
  • Metrics: global_rate_limiter_* (direct), capture_events_rerouted_overflow{reason="rate_limited"} (proxy signal)
  • CloudWatch cluster id: capture-globalratelimit-prod-redis

3. Event Restrictions Redis (EVENT_RESTRICTIONS_REDIS_URL)

  • ElastiCache: ingestion-prod-redis (separate writable cluster in both envs)
  • Stores Django-synced ingestion restriction configs
  • Falls back to primary Redis when URL is unset
  • Capture metrics: capture_event_restrictions_redis_fetch (labels: restriction_type, result in success/not_found/error/parse_error), capture_event_restrictions_stale, capture_event_restrictions_loaded_count

4. Contour Rate Limit Redis (ratelimit-{env}-redis) — NOT in capture binary

  • Per-IP DoS protection at the Envoy ingress layer, in front of capture
  • Metrics: ratelimit_service_* (label: domain="posthog")

Metric prefixes

Every prefix here can be discovered live with list_prometheus_metric_names using datasourceUid: "victoriametrics" and regex: "<prefix>.*".

PrefixDomainScope label
capture_*App metrics (~80 metrics)namespace, container
http_requests_*HTTP layer (shared)namespace=~"capture-.*"
capture_kafka_*Kafka producer (17 metrics)namespace, container
capture_billing_*Billing/quota tokens loadednamespace, cache_key
capture_event_restrictions_*Event restrictions (6 metrics)namespace, restriction_type
capture_ai_otel_*AI/OTel capture (12 metrics)namespace="capture-ai"
envoy_cluster_*L7 proxyenvoy_cluster_name=~"posthog_capture-.*"
aws_msk_*MSK broker-side (JMX)environment="prod-us" or "prod-eu"
ratelimit_service_*Contour rate limitdomain="posthog"
overflow_redirect_*Node.js ingestion overflow (downstream)ingestion_pipeline
kube_* / container_*K8s resourcesnamespace=~"capture-.*", pod=~"capture-.*"

Kafka topics

Topics capture produces to (discover live via topic label on capture_kafka_produce_avg_batch_size_bytes). Partition counts are encoded in topic names and differ by env (EU generally has fewer partitions).

Capture writes to two different Kafka backing systems depending on the pipeline:

  • MSK ingestion cluster — analytics events (main, overflow, historical, turbo), heatmaps, error tracking, client warnings
  • WarpStream — session replay (warpstream-replay-v2 VC), logs (warpstream-logs VC), traces (warpstream-traces VC)
Topic (prod-us / prod-eu)BackingPipeline
ingestion-analytics-main-1024 / -512MSKMain events
ingestion-analytics-overflow-128MSKOverflow (rate-limited / high-volume tokens)
ingestion-analytics-historical-128MSKHistorical backfill events
ingestion-analytics-turbo-1024MSKGeneral turbo (prod-us only)
ingestion-heatmaps-main-128MSKHeatmaps
ingestion-errortracking-main-128MSKError tracking
ingestion-errortracking-overflow-32MSKError tracking overflow
ingestion-clientwarnings-main-16 / -32MSKClient warnings
ingestion-sessionreplay-main-512 / -256WarpStreamSession replay
ingestion-sessionreplay-overflow-64 / -32WarpStreamSession replay overflow
ingestion-logsWarpStreamLog ingestion
ingestion-tracesWarpStreamTraces ingestion
ingestion-analytics-main-dlq (+ per-pipeline DLQ topics)MSKDead letter queues

Pyroscope services

Service nameDeployment
capture-analytics/capture-analyticsMain capture
capture-ai/capture-aiAI capture
capture-replay/capture-replayReplay capture
capture-mirrored/capture-mirroredMirror/canary (when running)
capture-logs/capture-logsLogs capture

Profile types: process_cpu:cpu:nanoseconds:cpu:nanoseconds, wall:wall:nanoseconds:wall:nanoseconds, memory:inuse_space:bytes:inuse_space:bytes, memory:inuse_objects:count:inuse_space:bytes.

Grafana dashboards

UIDTitle
captureCapture (golden-chart backend overview)
ddfkdj56ds11xceCapture Golden (Synced folder)
dffkdlee8ub5s0aIngestion - Capture Golden
capture-3000-envoy-codesCapture 3000 — Envoy Response Code Investigation
ingestion-generalCross-service ingestion overview
ingestion-analyticsIngestion - Analytics (per-pipeline breakdown)
ingestion-reliabilityError rates and reliability signals
ingestion-pipeline-performanceEnd-to-end pipeline latency
b2348f37-f276-498e-b72e-7cc2b5ec1455New capture (legacy)
contourEnvoy L7 proxy (set envoy_cluster_name=posthog_capture-analytics_3000)
ingestion-session-recordingsSession Replay ingestion

Discovery workflows

Prometheus / VictoriaMetrics

  1. list_prometheus_metric_namesdatasourceUid: "victoriametrics", regex: "capture_.*" to enumerate app metrics
  2. Pick a metric, then list_prometheus_label_names scoped to it — see available dimensions
  3. list_prometheus_label_values — discover actual values for a label (e.g. labelName: "cause" on capture_events_dropped_total)
  4. query_prometheus with PromQL — always scope by namespace (or container) and set a time range

Loki (logs)

  1. list_loki_label_namesdatasourceUid: "P44D702D3E93867EC" (Loki-logs; do NOT use primary Loki P8E80F9AEF21F6940 which 502s intermittently)
  2. list_loki_label_values for app or namespace — find capture containers
  3. query_loki_logs — e.g. {app=~"capture.*"} |= "error"

Pyroscope (profiling)

  1. list_pyroscope_profile_typesdata_source_uid: "pyroscope"
  2. fetch_pyroscope_profilematchers: '{service_name="capture-analytics/capture-analytics"}', profile_type: "process_cpu:cpu:nanoseconds:cpu:nanoseconds"

Dashboards

  1. search_dashboards — query "capture" or "ingestion"
  2. get_dashboard_by_uid — use a known UID (e.g. "capture") to get panel details
  3. get_dashboard_panel_queries — extract PromQL from existing panels

Redis / ElastiCache

  • Capture-side: discover capture_billing_*, capture_event_restrictions_*, capture_quota_* metrics in VictoriaMetrics
  • Infrastructure: CloudWatch datasource P034F075C744B399F for ElastiCache (CPU, memory, connections, latency). Cluster IDs: capture-globalratelimit-prod-redis, posthog-solo (prod-us primary)

Key metric domains

Categories of what to look for. Discover specific metrics live using the prefixes above.

HTTP layer — request rate, latency distribution (p50/p99), active connections, error rates by status code. Metrics: http_requests_* scoped by namespace, capture_active_connections.

Event lifecycle — the funnel from received to ingested to dropped/rerouted. capture_events_received_total -> capture_events_ingested_total -> capture_events_dropped_total. The cause label on drops has 20+ values (discover live). Key additions since the golden-chart migration: event_restriction_drop, event_too_big, otel_quota_drop, oversize_event, ai_opt_in, gathering, invalid_session, no_session_id, no_snapshot. Rerouting: capture_events_rerouted_overflow with reason label (rate_limited, force_limited, event_restriction). Also: capture_events_rerouted_custom_topic for topic-redirect restrictions.

Kafka producer — broker connectivity (capture_kafka_any_brokers_down, capture_kafka_broker_connected), queue saturation (_queue_depth / _queue_depth_limit), produce RTT (capture_kafka_produce_rtt_latency_us by quantile and broker), delivery errors (capture_kafka_produce_errors_total).

Billing and quotacapture_billing_limits_loaded_tokens by cache_key, capture_quota_limit_exceeded by resource (events, exceptions, llm_events, recordings, survey_responses).

Event restrictionscapture_event_restrictions_* for Redis fetch health, staleness, loaded count, applied restrictions by restriction_type (drop_event, force_overflow, redirect_to_topic, skip_person_processing).

Envoy proxy — upstream latency, response codes (2xx/4xx/5xx), connection health, circuit breakers (_open gauges), backend membership (healthy vs total), timeouts, retries. Always filter: envoy_cluster_name=~"posthog_capture-.*". For capture-replay, also check proxy-as-a-service_capture-replay_3000.

Contour rate limitratelimit_service_* for per-IP DoS protection. ratelimit_service_rate_limit_over_limit = actively rate-limited IPs.

MSK broker-sideaws_msk_* JMX metrics for capture-analytics, capture-ai, and other MSK-backed topics. Key signals: throttle time, network processor idle %, memory pool depletion, request queue size. Both envs have a dedicated ingestion MSK cluster separate from the events cluster (prod-us: c21; prod-eu: posthog-prod-eu-ingestion-2026-05-04).

WarpStreamwarpstream_agent_* metrics for capture-replay, capture-logs, and traces. These pipelines produce to in-cluster WarpStream agents, not MSK. Key signals: warpstream_agent_control_plane_operation_counter (by operation), warpstream_agent_file_cache_client_fetch_local_or_remote_counter (cache hit/miss ratio). Dashboards: warpstream (Agent Overview), dbfj5c31spa1ogf (MSK vs WarpStream — Active Produce Topics). US-only personal dashboards (not synced to EU): ws-coarse-lag-explore (Coarse Lag), 8e93b023-… (CH Consumer Lag). Per-VC KMinion instances: kminion-warpstream-replay, kminion-warpstream-logs, kminion-warpstream-traces.

K8s resourcescontainer_* and kube_* for CPU, memory, restarts, HPA state. Scope: namespace=~"capture-.*", pod=~"capture-.*".

Investigation playbooks

See references/investigation-playbooks.md for step-by-step workflows for common questions: health checks, event loss, latency, Kafka backpressure, rate limiting, Redis, and cross-env comparison.

NotebookLM 웹 임포터

원클릭으로 웹 페이지와 YouTube 동영상을 NotebookLM에 가져오기. 200,000명 이상이 사용 중.

Chrome 확장 프로그램 설치