monitoring-ingestion-pipelineby posthog

The ingestion pipeline ( nodejs/ ) is PostHog's Node.js event processing layer. It consumes events from Kafka (produced by the capture service), runs them through processing steps (person resolution, group assignment, property overrides, etc.), and produces enriched events to ClickHouse-bound Kafka topics.

npx skills add https://github.com/posthog/posthog --skill monitoring-ingestion-pipeline

Monitoring the ingestion pipeline with Grafana MCP

The ingestion pipeline (nodejs/) is PostHog's Node.js event processing layer. It consumes events from Kafka (produced by the capture service), runs them through processing steps (person resolution, group assignment, property overrides, etc.), and produces enriched events to ClickHouse-bound Kafka topics.

A single codebase is deployed as many K8s Deployments via the posthog-app Helm chart (golden-chart migration). Each deployment sets PLUGIN_SERVER_MODE and is distinguished in metrics by two default Prometheus labels:

  • ingestion_pipeline — values: analytics, heatmaps, clientwarnings, errortracking
  • ingestion_lane — values: main, overflow, historical, async, turbo

The app label (set by K8s pod labels) matches the deployment name and is the most universal scope filter across all telemetry domains.

This skill teaches how to discover live metrics using the Grafana MCP tools rather than memorizing metric names that change as the code evolves.

Environment context

The Grafana MCP is connected to a single Grafana instance scoped to one environment. If the user hasn't specified, ask which environment they want to investigate:

  • prod-us — US production (us-east-1)
  • prod-eu — EU production (eu-central-1)

Most ingestion app metrics are environment-specific by virtue of which Grafana you're connected to — they don't carry an environment label. CloudWatch metrics are also scoped to the connected Grafana's AWS account.

Cross-environment comparison requires switching Grafana instances (not possible in one session).

All datasource UIDs, dashboard UIDs, ingestion_pipeline/ingestion_lane label values, and ClickHouse type label values are identical across prod-us and prod-eu.

Key differences:

  • ingestion-analytics-turbo deployment exists only in prod-us.
  • Both envs use a dedicated ingestion MSK cluster separate from the events cluster — consumers point at msk-ingestion not events (prod-us: c21; prod-eu: posthog-prod-eu-ingestion-2026-05-04).
  • CloudWatch cluster IDs differ by region suffix (see topology sections below).

Observability landscape

Six telemetry domains, all validated identical across prod-us and prod-eu:

DomainDatasource UIDDiscovery toolScope filter
App metrics (VictoriaMetrics)victoriametricslist_prometheus_metric_namesSee metric prefixes below
App metrics (realtime)victoriametrics-realtimesamesame (lower retention, higher resolution)
LogsP44D702D3E93867EC (Loki-logs)list_loki_label_namesapp=~"ingestion-.*"
Profilingpyroscopelist_pyroscope_profile_typesSee Pyroscope services below
CloudWatch (ElastiCache, MSK, RDS)P034F075C744B399Fquery_prometheusenv-specific cluster IDs
Dashboardsn/asearch_dashboardsquery "ingestion" or deployment name

Datasource notes:

  • Do NOT use primary Loki (P8E80F9AEF21F6940) — it returns 502 intermittently in both envs. Always use Loki-logs (P44D702D3E93867EC).
  • Do NOT use CloudWatch Root (PAAE47F430CFD1449) — it exists only in prod-us.

Stable waypoints

These facts change infrequently and are hard to discover dynamically.

Deployment roles

All PLUGIN_SERVER_MODE=ingestion-v2 deployments (the "analytics ingestion" family), plus specialized modes. Each golden-chart deployment runs in its own namespace matching the deployment name.

Deployment nameModePipelineLaneConsumer groupConsume topic (EU example)
ingestion-analytics-mainingestion-v2analyticsmainingestion-analytics-mainingestion-analytics-main-512
ingestion-analytics-overflowingestion-v2analyticsoverflowingestion-analytics-overflowingestion-analytics-overflow-128
ingestion-analytics-historicalingestion-v2analyticshistoricalingestion-analytics-historicalingestion-analytics-historical-128
ingestion-analytics-asyncingestion-v2analyticsasyncingestion-analytics-asyncingestion-analytics-async-8
ingestion-analytics-turboingestion-v2analyticsturboingestion-analytics-turboingestion-analytics-turbo-1024
ingestion-clientwarnings-mainingestion-v2clientwarningsingestion-clientwarnings-mainingestion-clientwarnings-main-32
ingestion-heatmaps-mainingestion-v2heatmapsingestion-heatmaps-mainingestion-heatmaps-main-128
ingestion-errortracking-mainingestion-errortrackingerrortrackingingestion-errortracking-mainingestion-errortracking-main-128
ingestion-errortracking-overflowingestion-errortrackingerrortrackingingestion-errortracking-overflowingestion-errortracking-overflow-32
logs-ingestioningestion-logslogs-ingestioningestion-logs
traces-ingestion(traces)traces-ingestioningestion-traces
recordings-blob-ingestion-v2recordings-blob-ingestion-v2session-recordings-blob-v2ingestion-sessionreplay-main-256
recordings-blob-ingestion-v2-overflowrecordings-blob-ingestion-v2session-recordings-blob-v2-overflowingestion-sessionreplay-overflow-32

Notes:

  • ingestion-analytics-turbo exists only in prod-us.
  • Topic partition counts differ by env (e.g., main is 1024 in US, 512 in EU).
  • Each deployment also has a DLQ topic (ingestion-analytics-main-dlq, etc.).
  • KEDA autoscaling queries reference both old and new consumer group names during migration (e.g., groupId=~"ingestion-events|ingestion-analytics-main").

Metric prefixes

Every prefix here can be discovered live with list_prometheus_metric_names using datasourceUid: "victoriametrics" and regex: "<prefix>.*".

PrefixDomainKey scope labels
ingestion_*Core ingestion app metrics (~80 metrics)app, ingestion_pipeline, ingestion_lane
ingestion_lag_ms*Per-partition lag (primary lag signal)groupId, partition
consumed_batch_*Kafka consumer batch processingtopic, groupId
consumer_batch_* / consumer_background_*Consumer loop healthtopic, groupId
kafka_broker_*librdkafka broker statsbroker_id, broker_name, consumer_group
kafka_consumer_*Consumer rebalance, assignmentgroupId, type
events_pipeline_*Legacy pipeline step metricsstep_name
person_*Person processing (~30 metrics)db_write_mode, operation, method
group_* (non-AWS)Group processingoperation
personhog_*PersonHog gRPC client + servicemethod, source, client
overflow_redirect_*Stateful overflow routingtype, result, decision, operation
cookieless_*Cookieless mode
http_request_duration_secondsHTTP health/readiness servermethod, route, status_code
recording_blob_ingestion_v2_*Session replay ingestionapp
logs_ingestion_*Logs ingestion pipelineapp
error_tracking_* / cymbal_*Error tracking pipelineapp
kminion_kafka_*KMinion consumer group lag & topic offsetsgroup_id, topic_name, partition_id
aws_msk_kafka_*MSK broker-side JMX metricsenvironment
warpstream_agent_*WarpStream agent metrics (~10 metrics)virtual_cluster_id, agent_group, operation
kube_* / container_*K8s resourcesnamespace=~"ingestion-.*", container=~"ingestion-.*"
pg_* / pgbouncer_*Postgres exportervaries
ClickHouseMetrics_* / ClickHouseProfileEvents_* / ClickHouseAsyncMetrics_*ClickHouse cluster healthtype (=cluster role)
kafka_connect_*Kafka Connect bridge to ClickHousenamespace, connector
posthog_celery_clickhouse_*CH health monitors from Django celeryscenario

Redis topology

Ingestion workers depend on up to five Redis instances. Redis health is inferred from ingestion-side metrics and CloudWatch ElastiCache metrics.

Redis instanceElastiCache cluster (prod-us)Env varUse
Ingestion Redisingestion-prod-redisINGESTION_REDIS_HOSTOverflow state, pub/sub coordination
PostHog/Primary Redisposthog-soloPOSTHOG_REDIS_HOSTBilling/quota, restrictions, general
Cookieless Rediscookieless-prod-redisCOOKIELESS_REDIS_HOSTCookieless server hash mode
CDP Rediscdp-delivery-prod-redisCDP_REDIS_HOSTCDP Hog function delivery
Dedup Redisingestion-duplicates-prod-redisDEDUPLICATION_REDIS_HOSTEvent deduplication

Ingestion-side Redis metrics: overflow_redirect_redis_*, cookieless_redis_error. Infrastructure-side: CloudWatch datasource P034F075C744B399F.

prod-eu uses the same logical cluster names but different endpoint suffixes. The prod-eu primary Redis is posthog-prod-redis-encripted (sic — the typo is in the actual cluster name). The event restrictions Redis (ingestion-prod-redis) is a separate writable cluster from the primary — capture-analytics and ingestion workers both use it via EVENT_RESTRICTIONS_REDIS_URL.

Kafka topology

Ingestion workers interact with three Kafka systems via separate producer/consumer configs:

  1. MSK ingestion cluster (consume side) — KAFKA_CONSUMER_METADATA_BROKER_LIST. Capture produces here; ingestion workers consume. Carries all ingestion-analytics-*, ingestion-errortracking-*, ingestion-heatmaps-*, ingestion-clientwarnings-* topics. Both envs have a dedicated cluster (prod-us: c21, 12 brokers; prod-eu: posthog-prod-eu-ingestion-2026-05-04). KMinion: kminion-msk-ingestion.

  2. WarpStream ingestion VC (output side) — KAFKA_WARPSTREAM_PRODUCER_METADATA_BROKER_LIST. Ingestion workers produce ALL ClickHouse-bound output here (clickhouse_events_json, clickhouse_person, clickhouse_groups, clickhouse_heatmap_events, clickhouse_ai_events_json, etc.). In-cluster WarpStream agents (warpstream-ingestion-v2) with multi-AZ pools; plaintext, no TLS. KMinion: kminion-warpstream-ingestion.

  3. MSK ingestion cluster (feedback/DLQ producer) — same physical cluster as item 1, different producer config via KAFKA_INGESTION_PRODUCER_METADATA_BROKER_LIST. Used for overflow/DLQ/async topics that route events BACK to the ingestion system.

  • MSK (events/analytics) — the original events cluster. Still carries some legacy topics and ClickHouse consumer groups during migration. prod-us: posthog-prod-us-events-2026-03-08 (12 brokers, kafka.m7g.8xlarge); prod-eu: posthog-prod-eu-events-2025-10-16 (15 brokers). KMinion: kminion-msk-analytics.

WarpStream virtual clusters — each VC is a separate logical cluster backed by S3, with its own agent pool, KMinion instance, and topic namespace:

VC nameTopics carriedKMinion instance
warpstream-ingestion-v2clickhouse_events_json, clickhouse_person, clickhouse_person_distinct_id, clickhouse_groups, clickhouse_ai_events_json, clickhouse_heatmap_events, clickhouse_app_metrics2, clickhouse_tophog, clickhouse_ingestion_warnings, distinct_id_usage_events_json, log_entries, team_event_partitioned_events_jsonkminion-warpstream-ingestion
warpstream-replay-v2ingestion-sessionreplay-main-*, clickhouse_session_replay_events, clickhouse_session_replay_featureskminion-warpstream-replay
warpstream-logsingestion-logs, clickhouse_logskminion-warpstream-logs
warpstream-tracesingestion-traces, clickhouse_traceskminion-warpstream-traces
warpstream-sharedclickhouse_document_embeddings, error tracking fingerprint/issue topics, document_embeddings_inputkminion-warpstream-shared
warpstream-calculated-eventsclickhouse_precalculated_person_properties, clickhouse_prefiltered_events, cohort_membership_changed (US only)kminion-warpstream-calculated-events
warpstream-cyclotronCDP topics (cdp_cyclotron_hog*, cdp_internal_events, etc.)kminion-warpstream-cyclotron
warpstream-warehouse-pipelinesdata_warehouse_source_webhooks, data_warehouse_sources_jobs (US only)kminion-warpstream-warehouse-pipelines

WarpStream agent metrics: warpstream_agent_* prefix (~10 metrics). Key: control_plane_operation_counter (by operation), file_cache_client_fetch_local_or_remote_counter (cache hit/miss). Labels: virtual_cluster_id, agent_group (default, general, multi-az). Dashboards: warpstream (Agent Overview), dbfj5c31spa1ogf (MSK vs WarpStream — Active Produce Topics). US-only personal dashboards (not synced to EU): 8e93b023-… (CH Consumer Lag), ws-coarse-lag-explore (Coarse Lag exploration).

Postgres topology

DBAurora cluster (prod-us)Ingestion PgBouncer
Main app DBposthog-cloud-prod-us-east-1 (2x db.r8g.16xlarge)ingestion-default-pgbouncer.posthog.svc.cluster.local
Persons DBposthog-cloud-persons-prod-us-east-1 (3x db.r8g.24xlarge)ingestion-events-pgbouncer.posthog.svc.cluster.local

Postgres metrics via prometheus-postgres-exporter and prometheus-postgres-persons-exporter.

prod-eu: posthog-cloud-prod-eu-central-1 and posthog-cloud-persons-prod-eu-central-1.

ClickHouse topology

ClickHouse is the ultimate downstream consumer of events the ingestion pipeline produces. Ingestion workers never talk to CH directly — they publish to Kafka topics which CH consumes via its built-in Kafka engine and Kafka Connect (DuckLake). CH health directly impacts perceived ingestion quality: if CH falls behind on consumption, users see stale data.

Cluster roles (discovered via type label on ClickHouseMetrics_*):

type labelRoleNotes
eventsMain analytics events clusterConsumes clickhouse_events_json
onlineOnline/fast queries clusterReplicated from events
offlineOffline/batch queries clusterReplicated from events
mediumMedium-sized tablesPersons, groups
smallSmall/config tablesInfrequent writes
sessionsSession replay dataConsumes session recording topics
logsLogs clusterConsumes logs topics
logs-new-schemaLogs new schema migrationMigration target
ai-eventsAI/LLM eventsConsumes AI events topics
endpointsAPI endpoints clusterLightweight
migrationsMigration-specificSchema changes
aux / opsAuxiliary/operationsMaintenance
batch-exportsBatch exportsprod-us has this; may not exist in prod-eu
testTesting clusterMay not exist in all envs

Most type label values are identical across prod-us and prod-eu. Minor differences like batch-exports or test may exist only in one env.

Two consumption paths from Kafka to ClickHouse:

ClickHouse now reads primarily from the WarpStream ingestion VC (where ingestion workers produce their output), not directly from MSK.

  1. ClickHouse Kafka Engine — native CH feature. Metrics prefixed ClickHouseProfileEvents_Kafka* (e.g., KafkaMessagesPolled, KafkaRowsRead, KafkaRowsRejected, KafkaCommitFailures). Consumer groups: clickhouse_events_json (prod-us), group1 / group1_recent (prod-eu). These groups exist on BOTH MSK analytics and WarpStream ingestion — use the correct KMinion instance to distinguish: kminion-warpstream-ingestion for WarpStream, kminion-msk-analytics for MSK.
  2. Kafka Connect — runs in kafka-connect namespace, uses DuckLake sink connector. Metrics prefixed kafka_connect_* and kafka_connect_ducklake_sink_task_metrics_*. Consumer groups: connect-events-ducklake*.

Key health signals for ingestion operators:

  • kminion_kafka_consumer_group_topic_lag{app_kubernetes_io_instance="kminion-warpstream-ingestion", group_id=~"clickhouse_events_json|group1|group1_recent", topic_name="clickhouse_events_json"} — lag between ingestion output and CH consumption on the WarpStream path (the primary path)
  • kminion_kafka_consumer_group_topic_lag_seconds with same group filter — same in seconds
  • ClickHouseProfileEvents_KafkaRowsRejected — rows CH couldn't parse/insert
  • ClickHouseProfileEvents_FailedInsertQuery — insert failures (schema issues, too many parts, etc.)
  • ClickHouseAsyncMetrics_MaxPartCountForPartition — rising part count = merge pressure
  • ClickHouseMetrics_ReadonlyReplica — replicas that fell behind and went read-only
  • ClickHouseAsyncMetrics_ReplicasMaxAbsoluteDelay — max replication delay
  • posthog_celery_clickhouse_table_parts_count / _table_row_count — Django-side CH health monitors

Pyroscope services

Both old (ingestion/{name}) and new ({namespace}/{name}) formats coexist in Pyroscope during the golden-chart migration. Prefer the new {namespace}/{name} format.

Service name (new format)Deployment
ingestion-analytics-main/ingestion-analytics-mainMain analytics
ingestion-analytics-overflow/ingestion-analytics-overflowOverflow lane
ingestion-analytics-historical/ingestion-analytics-historicalHistorical lane
ingestion-analytics-async/ingestion-analytics-asyncAsync lane
ingestion-analytics-turbo/ingestion-analytics-turboTurbo lane (prod-us only)
ingestion-heatmaps-main/ingestion-heatmaps-mainHeatmaps
ingestion-clientwarnings-main/ingestion-clientwarnings-mainClient warnings
ingestion-errortracking-main/ingestion-errortracking-mainError tracking
ingestion-errortracking-overflow/ingestion-errortracking-overflowError tracking overflow
logs-ingestion/logs-ingestionLogs ingestion
traces-ingestion/traces-ingestionTraces ingestion
recordings/recordings-blob-ingestion-v2Session replay
recordings/recordings-blob-ingestion-v2-overflowSession replay overflow

Profile types: process_cpu:cpu:nanoseconds:cpu:nanoseconds, wall:wall:nanoseconds:wall:nanoseconds, memory:inuse_space:bytes:inuse_space:bytes, memory:inuse_objects:count:inuse_space:bytes.

Grafana dashboards

UIDTitleFocus
ingestion-generalIngestion - GeneralCross-service overview, E2E lag, topic flow
ingestion-analyticsIngestion - AnalyticsPer-pipeline analytics breakdown
ingestion-healthIngestion - HealthHealth overview across all ingestion services
ingestion-pipelinesIngestion - PipelinesPer-lane pipeline step breakdown
ingestion-pipeline-performanceIngestion - Pipeline PerformanceStep latency, batch utilization
ingestion-reliabilityIngestion - ReliabilityError rates, DLQ, drop causes
ingestion-autoscalingIngestion - AutoscalingHPA/KEDA scaling
ingestion-person-processingIngestion -- Person ProcessingPerson store, merge, cache
ingestion-group-processingIngestion -- Group ProcessingGroup store
ingestion-session-recordingsSession Replay -- IngestionReplay blob pipeline
dffkdlee8ub5s0aIngestion - Capture GoldenCapture-specific ingestion metrics (golden chart)
cesaxfujkyl8gfIngestion - DeduplicationEvent deduplication pipeline
pl-ingestion-slasIngestion — SLIs / SLOs / SLAsDynamic SLI/SLO view from ingestion_sli_* metrics
warpstreamWarpstream Agent OverviewAgent health, control plane ops, file cache
dbfj5c31spa1ogfMSK vs Warpstream — Active Produce TopicsSide-by-side produce volume comparison
8e93b023-a544-4a3b-8fac-123459d4eb84WarpStream: ClickHouse Consumer LagCH consumer lag on WarpStream topics (US only)
ws-coarse-lag-exploreWarpStream Coarse Lag — ExploreAgent-reported lag (US only, personal dashboard)
ceef2kuqw66tcaIngestion copy for warpstreamLegacy WarpStream migration view
personhog-servicePersonhog servicePersonHog latency decomposition
dbfgkwxs3gw8owdKMinion Consumer Group LagConsumer lag by group (including CH groups)
logsLogs (product)Logs ingestion
vm-clickhouse-cluster-overviewClickHouse (cluster overview)QPS, memory, disk, replication, parts, merges
8aa35a4a-091a-4645-ac8f-ae46901f0060ClickHouse Ingestion Layer - Resource UsageK8s resources for chi-ingestion-* pods
ddpxkllwxg268eClickHouse - Kafka consumptionCH Kafka engine consumption stats
clickhouse-keeperClickHouse KeeperZooKeeper replacement health
ef7h2todfg4xsdNew ClickHouse Cluster Merge OverviewMerge throughput
cdzv7o1635n9caKafka ConnectKafka Connect tasks, lag, DuckLake sink
deoz13wy08wsgaClickHouse - Disk capacity (EU ONLY)EU-specific disk dashboard

Discovery workflows

Prometheus / VictoriaMetrics

  1. list_prometheus_metric_namesdatasourceUid: "victoriametrics", regex: "ingestion_.*" to enumerate app metrics
  2. Pick a metric, then list_prometheus_label_names scoped to it — see available dimensions
  3. list_prometheus_label_values — discover actual values for a label (e.g. labelName: "cause" on ingestion_event_dropped_total)
  4. query_prometheus with PromQL — always scope by app or ingestion_pipeline and set a time range

Repeat with other prefixes: consumed_batch_*, person_*, personhog_*, overflow_redirect_*, ClickHouseMetrics_*, kafka_connect_*, etc.

Loki (logs)

  1. list_loki_label_namesdatasourceUid: "P44D702D3E93867EC"
  2. list_loki_label_values for app — find ingestion containers (values like ingestion-analytics-main, logs-ingestion, etc.)
  3. query_loki_logs — e.g. {app=~"ingestion-.*"} |= "error" or {namespace="clickhouse"} |= "Exception"

Pyroscope (profiling)

  1. list_pyroscope_profile_typesdata_source_uid: "pyroscope"
  2. fetch_pyroscope_profilematchers: '{service_name="ingestion-analytics-main/ingestion-analytics-main"}', profile_type: "process_cpu:cpu:nanoseconds:cpu:nanoseconds"

Dashboards

  1. search_dashboards — query "ingestion" or "clickhouse" or a specific deployment name
  2. get_dashboard_by_uid — use a known UID (e.g. "ingestion-general") to get panel details
  3. get_dashboard_panel_queries — extract PromQL from existing panels

Redis / ElastiCache

  • Ingestion-side: discover overflow_redirect_redis_*, cookieless_redis_* metrics in VictoriaMetrics
  • Infrastructure: CloudWatch datasource P034F075C744B399F for ElastiCache (CPU, memory, connections, latency). Cluster IDs: ingestion-prod-redis, posthog-solo (prod-us primary), ingestion-duplicates-prod-redis, cookieless-prod-redis

Postgres / Aurora

  • Ingestion-side: discover postgres_error_total, person_*, group_* metrics
  • Infrastructure: prometheus-postgres-exporter and prometheus-postgres-persons-exporter metrics
  • CloudWatch RDS: datasource P034F075C744B399F with cluster IDs posthog-cloud-prod-us-east-1 / posthog-cloud-persons-prod-us-east-1

ClickHouse

  • Cluster health: list_prometheus_metric_names with regex "ClickHouseMetrics_.*" or "ClickHouseProfileEvents_.*" Scope with type label for cluster role (e.g. type="events")
  • Kafka engine health: regex "ClickHouseProfileEvents_Kafka.*"
  • Kafka Connect: regex "kafka_connect_.*" — scope with namespace="kafka-connect"
  • Consumer lag (the bridge): kminion_kafka_consumer_group_topic_lag with group_id=~"clickhouse_events_json|group1|group1_recent|connect-events-ducklake.*" and topic_name="clickhouse_events_json". Important: scope with app_kubernetes_io_instance="kminion-warpstream-ingestion" for the WarpStream path (primary) or "kminion-msk-analytics" for the MSK path.
  • Logs: {namespace="clickhouse"} |= "Exception" or {namespace="kafka-connect"}
  • Dashboards: vm-clickhouse-cluster-overview, 8aa35a4a-091a-4645-ac8f-ae46901f0060, cdzv7o1635n9ca, 8e93b023-a544-4a3b-8fac-123459d4eb84 (WarpStream CH consumer lag)

Key metric domains

Categories of what to look for. Discover specific metrics live using the prefixes above.

Kafka consumer health — batch duration, messages consumed per batch, consumer group assignment/rebalance events, consumer lag (via KMinion). Metrics: consumed_batch_*, kafka_consumer_*, kminion_kafka_consumer_group_topic_lag* scoped by group_id.

Pipeline processing — step-level latency and error rates, pipeline result distribution (ingested, filtered, dropped, DLQ'd). Metrics: events_pipeline_step_ms, events_pipeline_step_error_total, ingestion_pipeline_results by result.

Person/group stores — person flush latency, cache hit rates, Postgres write latency, merge failures, properties size. Metrics: person_*, group_*, personhog_*.

Outputs — Kafka production to ClickHouse-bound topics. Message size, latency, errors. Metrics: ingestion_outputs_* by topic.

Overflow routing — stateful overflow decisions, Redis operations for overflow state. Metrics: overflow_redirect_* by type, result, decision.

ClickHouse downstream health — CH cluster QPS, memory, disk, merge pressure, replication lag, Kafka engine consumption (rows read/rejected/failed), Kafka Connect task health and consumer lag. This tells you whether events are actually making it to the query layer. Metrics: ClickHouseMetrics_*, ClickHouseProfileEvents_*, ClickHouseAsyncMetrics_* scoped by type; kafka_connect_* scoped by namespace.

K8s resourcescontainer_* and kube_* for CPU, memory, restarts, HPA state. Scope: namespace=~"ingestion-.*", pod=~"ingestion-.*" (or namespace="clickhouse", pod=~"chi-ingestion-.*" for CH ingestion pods).

Investigation playbooks

See references/investigation-playbooks.md for step-by-step workflows covering: health checks, event drops, latency, consumer lag, person processing, Kafka/MSK issues, Redis, Postgres, session replay, ClickHouse downstream health, and cross-environment comparison.

NotebookLM Web Importer

Import web pages and YouTube videos to NotebookLM with one click. Trusted by 200,000+ users.

Install Chrome Extension