monitoring-ingestion-pipelineโดย posthog
The ingestion pipeline ( nodejs/ ) is PostHog's Node.js event processing layer. It consumes events from Kafka (produced by the capture service), runs them through processing steps (person resolution, group assignment, property overrides, etc.), and produces enriched events to ClickHouse-bound Kafka topics.
npx skills add https://github.com/posthog/posthog --skill monitoring-ingestion-pipelineMonitoring the ingestion pipeline with Grafana MCP
The ingestion pipeline (nodejs/) is PostHog's Node.js event processing layer.
It consumes events from Kafka (produced by the capture service), runs them through
processing steps (person resolution, group assignment, property overrides, etc.),
and produces enriched events to ClickHouse-bound Kafka topics.
A single codebase is deployed as many K8s Deployments via the posthog-app
Helm chart (golden-chart migration). Each deployment sets PLUGIN_SERVER_MODE
and is distinguished in metrics by two default Prometheus labels:
ingestion_pipeline— values:analytics,heatmaps,clientwarnings,errortrackingingestion_lane— values:main,overflow,historical,async,turbo
The app label (set by K8s pod labels) matches the deployment name and is the most
universal scope filter across all telemetry domains.
This skill teaches how to discover live metrics using the Grafana MCP tools rather than memorizing metric names that change as the code evolves.
Environment context
The Grafana MCP is connected to a single Grafana instance scoped to one environment. If the user hasn't specified, ask which environment they want to investigate:
- prod-us — US production (us-east-1)
- prod-eu — EU production (eu-central-1)
Most ingestion app metrics are environment-specific by virtue of which Grafana
you're connected to — they don't carry an environment label. CloudWatch metrics
are also scoped to the connected Grafana's AWS account.
Cross-environment comparison requires switching Grafana instances (not possible in one session).
All datasource UIDs, dashboard UIDs, ingestion_pipeline/ingestion_lane label values,
and ClickHouse type label values are identical across prod-us and prod-eu.
Key differences:
ingestion-analytics-turbodeployment exists only in prod-us.- Both envs use a dedicated ingestion MSK cluster separate from the events cluster —
consumers point at
msk-ingestionnotevents(prod-us: c21; prod-eu:posthog-prod-eu-ingestion-2026-05-04). - CloudWatch cluster IDs differ by region suffix (see topology sections below).
Observability landscape
Six telemetry domains, all validated identical across prod-us and prod-eu:
| Domain | Datasource UID | Discovery tool | Scope filter |
|---|---|---|---|
| App metrics (VictoriaMetrics) | victoriametrics | list_prometheus_metric_names | See metric prefixes below |
| App metrics (realtime) | victoriametrics-realtime | same | same (lower retention, higher resolution) |
| Logs | P44D702D3E93867EC (Loki-logs) | list_loki_label_names | app=~"ingestion-.*" |
| Profiling | pyroscope | list_pyroscope_profile_types | See Pyroscope services below |
| CloudWatch (ElastiCache, MSK, RDS) | P034F075C744B399F | query_prometheus | env-specific cluster IDs |
| Dashboards | n/a | search_dashboards | query "ingestion" or deployment name |
Datasource notes:
- Do NOT use primary Loki (
P8E80F9AEF21F6940) — it returns 502 intermittently in both envs. Always use Loki-logs (P44D702D3E93867EC). - Do NOT use
CloudWatch Root(PAAE47F430CFD1449) — it exists only in prod-us.
Stable waypoints
These facts change infrequently and are hard to discover dynamically.
Deployment roles
All PLUGIN_SERVER_MODE=ingestion-v2 deployments (the "analytics ingestion" family), plus specialized modes.
Each golden-chart deployment runs in its own namespace matching the deployment name.
| Deployment name | Mode | Pipeline | Lane | Consumer group | Consume topic (EU example) |
|---|---|---|---|---|---|
ingestion-analytics-main | ingestion-v2 | analytics | main | ingestion-analytics-main | ingestion-analytics-main-512 |
ingestion-analytics-overflow | ingestion-v2 | analytics | overflow | ingestion-analytics-overflow | ingestion-analytics-overflow-128 |
ingestion-analytics-historical | ingestion-v2 | analytics | historical | ingestion-analytics-historical | ingestion-analytics-historical-128 |
ingestion-analytics-async | ingestion-v2 | analytics | async | ingestion-analytics-async | ingestion-analytics-async-8 |
ingestion-analytics-turbo | ingestion-v2 | analytics | turbo | ingestion-analytics-turbo | ingestion-analytics-turbo-1024 |
ingestion-clientwarnings-main | ingestion-v2 | clientwarnings | — | ingestion-clientwarnings-main | ingestion-clientwarnings-main-32 |
ingestion-heatmaps-main | ingestion-v2 | heatmaps | — | ingestion-heatmaps-main | ingestion-heatmaps-main-128 |
ingestion-errortracking-main | ingestion-errortracking | errortracking | — | ingestion-errortracking-main | ingestion-errortracking-main-128 |
ingestion-errortracking-overflow | ingestion-errortracking | errortracking | — | ingestion-errortracking-overflow | ingestion-errortracking-overflow-32 |
logs-ingestion | ingestion-logs | — | — | logs-ingestion | ingestion-logs |
traces-ingestion | (traces) | — | — | traces-ingestion | ingestion-traces |
recordings-blob-ingestion-v2 | recordings-blob-ingestion-v2 | — | — | session-recordings-blob-v2 | ingestion-sessionreplay-main-256 |
recordings-blob-ingestion-v2-overflow | recordings-blob-ingestion-v2 | — | — | session-recordings-blob-v2-overflow | ingestion-sessionreplay-overflow-32 |
Notes:
ingestion-analytics-turboexists only in prod-us.- Topic partition counts differ by env (e.g., main is 1024 in US, 512 in EU).
- Each deployment also has a DLQ topic (
ingestion-analytics-main-dlq, etc.). - KEDA autoscaling queries reference both old and new consumer group names during migration
(e.g.,
groupId=~"ingestion-events|ingestion-analytics-main").
Metric prefixes
Every prefix here can be discovered live with list_prometheus_metric_names
using datasourceUid: "victoriametrics" and regex: "<prefix>.*".
| Prefix | Domain | Key scope labels |
|---|---|---|
ingestion_* | Core ingestion app metrics (~80 metrics) | app, ingestion_pipeline, ingestion_lane |
ingestion_lag_ms* | Per-partition lag (primary lag signal) | groupId, partition |
consumed_batch_* | Kafka consumer batch processing | topic, groupId |
consumer_batch_* / consumer_background_* | Consumer loop health | topic, groupId |
kafka_broker_* | librdkafka broker stats | broker_id, broker_name, consumer_group |
kafka_consumer_* | Consumer rebalance, assignment | groupId, type |
events_pipeline_* | Legacy pipeline step metrics | step_name |
person_* | Person processing (~30 metrics) | db_write_mode, operation, method |
group_* (non-AWS) | Group processing | operation |
personhog_* | PersonHog gRPC client + service | method, source, client |
overflow_redirect_* | Stateful overflow routing | type, result, decision, operation |
cookieless_* | Cookieless mode | — |
http_request_duration_seconds | HTTP health/readiness server | method, route, status_code |
recording_blob_ingestion_v2_* | Session replay ingestion | app |
logs_ingestion_* | Logs ingestion pipeline | app |
error_tracking_* / cymbal_* | Error tracking pipeline | app |
kminion_kafka_* | KMinion consumer group lag & topic offsets | group_id, topic_name, partition_id |
aws_msk_kafka_* | MSK broker-side JMX metrics | environment |
warpstream_agent_* | WarpStream agent metrics (~10 metrics) | virtual_cluster_id, agent_group, operation |
kube_* / container_* | K8s resources | namespace=~"ingestion-.*", container=~"ingestion-.*" |
pg_* / pgbouncer_* | Postgres exporter | varies |
ClickHouseMetrics_* / ClickHouseProfileEvents_* / ClickHouseAsyncMetrics_* | ClickHouse cluster health | type (=cluster role) |
kafka_connect_* | Kafka Connect bridge to ClickHouse | namespace, connector |
posthog_celery_clickhouse_* | CH health monitors from Django celery | scenario |
Redis topology
Ingestion workers depend on up to five Redis instances. Redis health is inferred from ingestion-side metrics and CloudWatch ElastiCache metrics.
| Redis instance | ElastiCache cluster (prod-us) | Env var | Use |
|---|---|---|---|
| Ingestion Redis | ingestion-prod-redis | INGESTION_REDIS_HOST | Overflow state, pub/sub coordination |
| PostHog/Primary Redis | posthog-solo | POSTHOG_REDIS_HOST | Billing/quota, restrictions, general |
| Cookieless Redis | cookieless-prod-redis | COOKIELESS_REDIS_HOST | Cookieless server hash mode |
| CDP Redis | cdp-delivery-prod-redis | CDP_REDIS_HOST | CDP Hog function delivery |
| Dedup Redis | ingestion-duplicates-prod-redis | DEDUPLICATION_REDIS_HOST | Event deduplication |
Ingestion-side Redis metrics: overflow_redirect_redis_*, cookieless_redis_error.
Infrastructure-side: CloudWatch datasource P034F075C744B399F.
prod-eu uses the same logical cluster names but different endpoint suffixes.
The prod-eu primary Redis is posthog-prod-redis-encripted (sic — the typo is in the actual cluster name).
The event restrictions Redis (ingestion-prod-redis) is a separate writable cluster from
the primary — capture-analytics and ingestion workers both use it via EVENT_RESTRICTIONS_REDIS_URL.
Kafka topology
Ingestion workers interact with three Kafka systems via separate producer/consumer configs:
-
MSK ingestion cluster (consume side) —
KAFKA_CONSUMER_METADATA_BROKER_LIST. Capture produces here; ingestion workers consume. Carries allingestion-analytics-*,ingestion-errortracking-*,ingestion-heatmaps-*,ingestion-clientwarnings-*topics. Both envs have a dedicated cluster (prod-us: c21, 12 brokers; prod-eu:posthog-prod-eu-ingestion-2026-05-04). KMinion:kminion-msk-ingestion. -
WarpStream ingestion VC (output side) —
KAFKA_WARPSTREAM_PRODUCER_METADATA_BROKER_LIST. Ingestion workers produce ALL ClickHouse-bound output here (clickhouse_events_json,clickhouse_person,clickhouse_groups,clickhouse_heatmap_events,clickhouse_ai_events_json, etc.). In-cluster WarpStream agents (warpstream-ingestion-v2) with multi-AZ pools; plaintext, no TLS. KMinion:kminion-warpstream-ingestion. -
MSK ingestion cluster (feedback/DLQ producer) — same physical cluster as item 1, different producer config via
KAFKA_INGESTION_PRODUCER_METADATA_BROKER_LIST. Used for overflow/DLQ/async topics that route events BACK to the ingestion system.
- MSK (events/analytics) — the original events cluster. Still carries some legacy topics and
ClickHouse consumer groups during migration. prod-us:
posthog-prod-us-events-2026-03-08(12 brokers,kafka.m7g.8xlarge); prod-eu:posthog-prod-eu-events-2025-10-16(15 brokers). KMinion:kminion-msk-analytics.
WarpStream virtual clusters — each VC is a separate logical cluster backed by S3, with its own agent pool, KMinion instance, and topic namespace:
| VC name | Topics carried | KMinion instance |
|---|---|---|
warpstream-ingestion-v2 | clickhouse_events_json, clickhouse_person, clickhouse_person_distinct_id, clickhouse_groups, clickhouse_ai_events_json, clickhouse_heatmap_events, clickhouse_app_metrics2, clickhouse_tophog, clickhouse_ingestion_warnings, distinct_id_usage_events_json, log_entries, team_event_partitioned_events_json | kminion-warpstream-ingestion |
warpstream-replay-v2 | ingestion-sessionreplay-main-*, clickhouse_session_replay_events, clickhouse_session_replay_features | kminion-warpstream-replay |
warpstream-logs | ingestion-logs, clickhouse_logs | kminion-warpstream-logs |
warpstream-traces | ingestion-traces, clickhouse_traces | kminion-warpstream-traces |
warpstream-shared | clickhouse_document_embeddings, error tracking fingerprint/issue topics, document_embeddings_input | kminion-warpstream-shared |
warpstream-calculated-events | clickhouse_precalculated_person_properties, clickhouse_prefiltered_events, cohort_membership_changed (US only) | kminion-warpstream-calculated-events |
warpstream-cyclotron | CDP topics (cdp_cyclotron_hog*, cdp_internal_events, etc.) | kminion-warpstream-cyclotron |
warpstream-warehouse-pipelines | data_warehouse_source_webhooks, data_warehouse_sources_jobs (US only) | kminion-warpstream-warehouse-pipelines |
WarpStream agent metrics: warpstream_agent_* prefix (~10 metrics).
Key: control_plane_operation_counter (by operation), file_cache_client_fetch_local_or_remote_counter (cache hit/miss).
Labels: virtual_cluster_id, agent_group (default, general, multi-az).
Dashboards: warpstream (Agent Overview), dbfj5c31spa1ogf (MSK vs WarpStream — Active Produce Topics).
US-only personal dashboards (not synced to EU): 8e93b023-… (CH Consumer Lag),
ws-coarse-lag-explore (Coarse Lag exploration).
Postgres topology
| DB | Aurora cluster (prod-us) | Ingestion PgBouncer |
|---|---|---|
| Main app DB | posthog-cloud-prod-us-east-1 (2x db.r8g.16xlarge) | ingestion-default-pgbouncer.posthog.svc.cluster.local |
| Persons DB | posthog-cloud-persons-prod-us-east-1 (3x db.r8g.24xlarge) | ingestion-events-pgbouncer.posthog.svc.cluster.local |
Postgres metrics via prometheus-postgres-exporter and prometheus-postgres-persons-exporter.
prod-eu: posthog-cloud-prod-eu-central-1 and posthog-cloud-persons-prod-eu-central-1.
ClickHouse topology
ClickHouse is the ultimate downstream consumer of events the ingestion pipeline produces. Ingestion workers never talk to CH directly — they publish to Kafka topics which CH consumes via its built-in Kafka engine and Kafka Connect (DuckLake). CH health directly impacts perceived ingestion quality: if CH falls behind on consumption, users see stale data.
Cluster roles (discovered via type label on ClickHouseMetrics_*):
type label | Role | Notes |
|---|---|---|
events | Main analytics events cluster | Consumes clickhouse_events_json |
online | Online/fast queries cluster | Replicated from events |
offline | Offline/batch queries cluster | Replicated from events |
medium | Medium-sized tables | Persons, groups |
small | Small/config tables | Infrequent writes |
sessions | Session replay data | Consumes session recording topics |
logs | Logs cluster | Consumes logs topics |
logs-new-schema | Logs new schema migration | Migration target |
ai-events | AI/LLM events | Consumes AI events topics |
endpoints | API endpoints cluster | Lightweight |
migrations | Migration-specific | Schema changes |
aux / ops | Auxiliary/operations | Maintenance |
batch-exports | Batch exports | prod-us has this; may not exist in prod-eu |
test | Testing cluster | May not exist in all envs |
Most type label values are identical across prod-us and prod-eu. Minor differences
like batch-exports or test may exist only in one env.
Two consumption paths from Kafka to ClickHouse:
ClickHouse now reads primarily from the WarpStream ingestion VC (where ingestion workers produce their output), not directly from MSK.
- ClickHouse Kafka Engine — native CH feature. Metrics prefixed
ClickHouseProfileEvents_Kafka*(e.g.,KafkaMessagesPolled,KafkaRowsRead,KafkaRowsRejected,KafkaCommitFailures). Consumer groups:clickhouse_events_json(prod-us),group1/group1_recent(prod-eu). These groups exist on BOTH MSK analytics and WarpStream ingestion — use the correct KMinion instance to distinguish:kminion-warpstream-ingestionfor WarpStream,kminion-msk-analyticsfor MSK. - Kafka Connect — runs in
kafka-connectnamespace, uses DuckLake sink connector. Metrics prefixedkafka_connect_*andkafka_connect_ducklake_sink_task_metrics_*. Consumer groups:connect-events-ducklake*.
Key health signals for ingestion operators:
kminion_kafka_consumer_group_topic_lag{app_kubernetes_io_instance="kminion-warpstream-ingestion", group_id=~"clickhouse_events_json|group1|group1_recent", topic_name="clickhouse_events_json"}— lag between ingestion output and CH consumption on the WarpStream path (the primary path)kminion_kafka_consumer_group_topic_lag_secondswith same group filter — same in secondsClickHouseProfileEvents_KafkaRowsRejected— rows CH couldn't parse/insertClickHouseProfileEvents_FailedInsertQuery— insert failures (schema issues, too many parts, etc.)ClickHouseAsyncMetrics_MaxPartCountForPartition— rising part count = merge pressureClickHouseMetrics_ReadonlyReplica— replicas that fell behind and went read-onlyClickHouseAsyncMetrics_ReplicasMaxAbsoluteDelay— max replication delayposthog_celery_clickhouse_table_parts_count/_table_row_count— Django-side CH health monitors
Pyroscope services
Both old (ingestion/{name}) and new ({namespace}/{name}) formats coexist in Pyroscope
during the golden-chart migration. Prefer the new {namespace}/{name} format.
| Service name (new format) | Deployment |
|---|---|
ingestion-analytics-main/ingestion-analytics-main | Main analytics |
ingestion-analytics-overflow/ingestion-analytics-overflow | Overflow lane |
ingestion-analytics-historical/ingestion-analytics-historical | Historical lane |
ingestion-analytics-async/ingestion-analytics-async | Async lane |
ingestion-analytics-turbo/ingestion-analytics-turbo | Turbo lane (prod-us only) |
ingestion-heatmaps-main/ingestion-heatmaps-main | Heatmaps |
ingestion-clientwarnings-main/ingestion-clientwarnings-main | Client warnings |
ingestion-errortracking-main/ingestion-errortracking-main | Error tracking |
ingestion-errortracking-overflow/ingestion-errortracking-overflow | Error tracking overflow |
logs-ingestion/logs-ingestion | Logs ingestion |
traces-ingestion/traces-ingestion | Traces ingestion |
recordings/recordings-blob-ingestion-v2 | Session replay |
recordings/recordings-blob-ingestion-v2-overflow | Session replay overflow |
Profile types: process_cpu:cpu:nanoseconds:cpu:nanoseconds,
wall:wall:nanoseconds:wall:nanoseconds,
memory:inuse_space:bytes:inuse_space:bytes,
memory:inuse_objects:count:inuse_space:bytes.
Grafana dashboards
| UID | Title | Focus |
|---|---|---|
ingestion-general | Ingestion - General | Cross-service overview, E2E lag, topic flow |
ingestion-analytics | Ingestion - Analytics | Per-pipeline analytics breakdown |
ingestion-health | Ingestion - Health | Health overview across all ingestion services |
ingestion-pipelines | Ingestion - Pipelines | Per-lane pipeline step breakdown |
ingestion-pipeline-performance | Ingestion - Pipeline Performance | Step latency, batch utilization |
ingestion-reliability | Ingestion - Reliability | Error rates, DLQ, drop causes |
ingestion-autoscaling | Ingestion - Autoscaling | HPA/KEDA scaling |
ingestion-person-processing | Ingestion -- Person Processing | Person store, merge, cache |
ingestion-group-processing | Ingestion -- Group Processing | Group store |
ingestion-session-recordings | Session Replay -- Ingestion | Replay blob pipeline |
dffkdlee8ub5s0a | Ingestion - Capture Golden | Capture-specific ingestion metrics (golden chart) |
cesaxfujkyl8gf | Ingestion - Deduplication | Event deduplication pipeline |
pl-ingestion-slas | Ingestion — SLIs / SLOs / SLAs | Dynamic SLI/SLO view from ingestion_sli_* metrics |
warpstream | Warpstream Agent Overview | Agent health, control plane ops, file cache |
dbfj5c31spa1ogf | MSK vs Warpstream — Active Produce Topics | Side-by-side produce volume comparison |
8e93b023-a544-4a3b-8fac-123459d4eb84 | WarpStream: ClickHouse Consumer Lag | CH consumer lag on WarpStream topics (US only) |
ws-coarse-lag-explore | WarpStream Coarse Lag — Explore | Agent-reported lag (US only, personal dashboard) |
ceef2kuqw66tca | Ingestion copy for warpstream | Legacy WarpStream migration view |
personhog-service | Personhog service | PersonHog latency decomposition |
dbfgkwxs3gw8owd | KMinion Consumer Group Lag | Consumer lag by group (including CH groups) |
logs | Logs (product) | Logs ingestion |
vm-clickhouse-cluster-overview | ClickHouse (cluster overview) | QPS, memory, disk, replication, parts, merges |
8aa35a4a-091a-4645-ac8f-ae46901f0060 | ClickHouse Ingestion Layer - Resource Usage | K8s resources for chi-ingestion-* pods |
ddpxkllwxg268e | ClickHouse - Kafka consumption | CH Kafka engine consumption stats |
clickhouse-keeper | ClickHouse Keeper | ZooKeeper replacement health |
ef7h2todfg4xsd | New ClickHouse Cluster Merge Overview | Merge throughput |
cdzv7o1635n9ca | Kafka Connect | Kafka Connect tasks, lag, DuckLake sink |
deoz13wy08wsga | ClickHouse - Disk capacity (EU ONLY) | EU-specific disk dashboard |
Discovery workflows
Prometheus / VictoriaMetrics
list_prometheus_metric_names—datasourceUid: "victoriametrics",regex: "ingestion_.*"to enumerate app metrics- Pick a metric, then
list_prometheus_label_namesscoped to it — see available dimensions list_prometheus_label_values— discover actual values for a label (e.g.labelName: "cause"oningestion_event_dropped_total)query_prometheuswith PromQL — always scope byapporingestion_pipelineand set a time range
Repeat with other prefixes: consumed_batch_*, person_*, personhog_*,
overflow_redirect_*, ClickHouseMetrics_*, kafka_connect_*, etc.
Loki (logs)
list_loki_label_names—datasourceUid: "P44D702D3E93867EC"list_loki_label_valuesforapp— find ingestion containers (values likeingestion-analytics-main,logs-ingestion, etc.)query_loki_logs— e.g.{app=~"ingestion-.*"} |= "error"or{namespace="clickhouse"} |= "Exception"
Pyroscope (profiling)
list_pyroscope_profile_types—data_source_uid: "pyroscope"fetch_pyroscope_profile—matchers: '{service_name="ingestion-analytics-main/ingestion-analytics-main"}',profile_type: "process_cpu:cpu:nanoseconds:cpu:nanoseconds"
Dashboards
search_dashboards— query"ingestion"or"clickhouse"or a specific deployment nameget_dashboard_by_uid— use a known UID (e.g."ingestion-general") to get panel detailsget_dashboard_panel_queries— extract PromQL from existing panels
Redis / ElastiCache
- Ingestion-side: discover
overflow_redirect_redis_*,cookieless_redis_*metrics in VictoriaMetrics - Infrastructure: CloudWatch datasource
P034F075C744B399Ffor ElastiCache (CPU, memory, connections, latency). Cluster IDs:ingestion-prod-redis,posthog-solo(prod-us primary),ingestion-duplicates-prod-redis,cookieless-prod-redis
Postgres / Aurora
- Ingestion-side: discover
postgres_error_total,person_*,group_*metrics - Infrastructure:
prometheus-postgres-exporterandprometheus-postgres-persons-exportermetrics - CloudWatch RDS: datasource
P034F075C744B399Fwith cluster IDsposthog-cloud-prod-us-east-1/posthog-cloud-persons-prod-us-east-1
ClickHouse
- Cluster health:
list_prometheus_metric_nameswith regex"ClickHouseMetrics_.*"or"ClickHouseProfileEvents_.*"Scope withtypelabel for cluster role (e.g.type="events") - Kafka engine health: regex
"ClickHouseProfileEvents_Kafka.*" - Kafka Connect: regex
"kafka_connect_.*"— scope withnamespace="kafka-connect" - Consumer lag (the bridge):
kminion_kafka_consumer_group_topic_lagwithgroup_id=~"clickhouse_events_json|group1|group1_recent|connect-events-ducklake.*"andtopic_name="clickhouse_events_json". Important: scope withapp_kubernetes_io_instance="kminion-warpstream-ingestion"for the WarpStream path (primary) or"kminion-msk-analytics"for the MSK path. - Logs:
{namespace="clickhouse"} |= "Exception"or{namespace="kafka-connect"} - Dashboards:
vm-clickhouse-cluster-overview,8aa35a4a-091a-4645-ac8f-ae46901f0060,cdzv7o1635n9ca,8e93b023-a544-4a3b-8fac-123459d4eb84(WarpStream CH consumer lag)
Key metric domains
Categories of what to look for. Discover specific metrics live using the prefixes above.
Kafka consumer health — batch duration, messages consumed per batch, consumer group
assignment/rebalance events, consumer lag (via KMinion). Metrics: consumed_batch_*,
kafka_consumer_*, kminion_kafka_consumer_group_topic_lag* scoped by group_id.
Pipeline processing — step-level latency and error rates, pipeline result distribution
(ingested, filtered, dropped, DLQ'd). Metrics: events_pipeline_step_ms,
events_pipeline_step_error_total, ingestion_pipeline_results by result.
Person/group stores — person flush latency, cache hit rates, Postgres write latency,
merge failures, properties size. Metrics: person_*, group_*, personhog_*.
Outputs — Kafka production to ClickHouse-bound topics. Message size, latency, errors.
Metrics: ingestion_outputs_* by topic.
Overflow routing — stateful overflow decisions, Redis operations for overflow state.
Metrics: overflow_redirect_* by type, result, decision.
ClickHouse downstream health — CH cluster QPS, memory, disk, merge pressure,
replication lag, Kafka engine consumption (rows read/rejected/failed), Kafka Connect
task health and consumer lag. This tells you whether events are actually making it
to the query layer. Metrics: ClickHouseMetrics_*, ClickHouseProfileEvents_*,
ClickHouseAsyncMetrics_* scoped by type; kafka_connect_* scoped by namespace.
K8s resources — container_* and kube_* for CPU, memory, restarts, HPA state.
Scope: namespace=~"ingestion-.*", pod=~"ingestion-.*" (or namespace="clickhouse",
pod=~"chi-ingestion-.*" for CH ingestion pods).
Investigation playbooks
See references/investigation-playbooks.md for step-by-step workflows covering: health checks, event drops, latency, consumer lag, person processing, Kafka/MSK issues, Redis, Postgres, session replay, ClickHouse downstream health, and cross-environment comparison.