diagnosing-failed-warehouse-syncs作者: posthog

Work top-down when a data warehouse source or table is failing, stuck, or producing bad data: source → schema → recovery action. Do not jump straight to "resync from scratch" — that discards synced data and restarts from zero, which is rarely the right first step.

npx skills add https://github.com/posthog/ai-plugin --skill diagnosing-failed-warehouse-syncs

Diagnosing failed data warehouse syncs

Work top-down when a data warehouse source or table is failing, stuck, or producing bad data: source → schema → recovery action. Do not jump straight to "resync from scratch" — that discards synced data and restarts from zero, which is rarely the right first step.

When to use this skill

  • The user reports a specific sync is failing (e.g. "my Stripe source is red")
  • A table has been in Running state far longer than expected
  • Data in a warehouse table is stale, missing rows, or looks corrupt
  • Latest rows aren't appearing despite the schema being marked Completed
  • The user is choosing between cancel / reload / resync / delete-data and isn't sure which
  • Another skill — typically auditing-warehouse-data-health — has surfaced a failing source or schema and the user wants to dig into it

Both entry points (user-reported and audit-handoff) use the same workflow; the audit just means you already know which item to diagnose and can skip Step 1's discovery search.

Available tools

ToolPurpose
external-data-sources-listList all sources with connection status and latest error
external-data-sources-retrieveFull details for one source including all its schemas
external-data-schemas-listAll table schemas across all sources, with per-table status + latest_error
external-data-schemas-retrieveFull details for one schema including sync_type_config
external-data-schemas-cancelCancel a sync currently in Running state
external-data-schemas-reloadTrigger a sync using the configured sync method (respects incremental)
external-data-schemas-resyncFull resync — wipes synced data and restarts. Destructive
external-data-schemas-delete-dataDelete the synced table but keep the schema entry
external-data-schemas-partial-updateChange sync_type / incremental_field / cdc_table_mode
external-data-sources-partial-updateUpdate a source's credentials (job_inputs) after rotation
external-data-sources-reloadRetrigger syncs for every enabled schema on a source
external-data-sources-refresh-schemasRe-fetch the source's table list to pick up new tables
external-data-sources-check-cdc-prerequisites-createVerify Postgres CDC setup for a source
external-data-schemas-incremental-fields-createRefresh candidate incremental fields when the source schema has changed
external-data-sources-webhook-info-retrieveCheck webhook registration state and external service status
external-data-sources-create-webhook-createRe-register a webhook that was lost or never registered
external-data-sources-update-webhook-inputs-createUpdate the signing secret after rotation on the source side
external-data-sources-delete-webhook-createRemove a broken webhook before re-registering

Workflow

Step 1 — Locate the failing item

If the user named a source, go straight to external-data-sources-retrieve. Otherwise start with external-data-sources-list and external-data-schemas-list to find what's red.

Two kinds of failure:

  • Source-level (ExternalDataSource.status = "Error"): the connection itself is broken — credentials expired, host unreachable, account disabled. Affects every table.
  • Schema-level — the source connects fine but one or more tables are failing. In the serialized API response from external-data-schemas-list, look for status values "Failed", "Billing limits", or "Billing limits too low". (The underlying model enum values are BillingLimitReached and BillingLimitTooLow, but the serializer rewrites them — match on both the human-readable and enum forms to be safe.)

A source can look Completed at the top level while one of its schemas is Failed — always check both.

Step 2 — Classify the schema status

From external-data-schemas-list, each schema has a status:

StatusMeaningUsually means
RunningSync currently executingNormal, unless stuck for hours
CompletedLast sync finished successfullyHealthy
FailedLast sync errored — see latest_errorNeeds diagnosis
PausedUser disabled sync (should_sync = false)Intentional
Billing limits (serializer) / BillingLimitReached (enum)Team hit its warehouse row quotaBilling issue, not a technical failure
Billing limits too low (serializer) / BillingLimitTooLow (enum)Team has insufficient creditBilling issue

Always check last_synced_at alongside status. A schema in Running with last_synced_at from 12 hours ago is almost certainly stuck, even though the status isn't Failed.

Step 3 — Interpret latest_error

Map the latest_error string to a root cause. Common patterns:

Error substringRoot causeFix
authentication failed, 401, 403, invalid credentialsCredentials expired or rotatedUser rotates creds, then external-data-sources-partial-update with new job_inputs
Could not establish session to SSH gatewaySSH tunnel misconfigured or remote host downUser checks SSH host/key/bastion
Primary key required for incremental syncsTable has no PK and sync_type is incremental/cdcEither add PK in source, or switch schema to full_refresh
primary keys for this table are not uniqueDeclared PK columns aren't actually uniquePick different PK columns via partial-update
Integration matching query does not existSource's saved integration was deletedRecreate the source
column "X" does not exist, does not have a column namedSchema drift — incremental field or tracked column removedUse incremental-fields-create to re-detect, then partial-update
relation "..." does not existSource table was dropped/renamedRemove schema or rename source-side
SSL, connection refused, timeout, unreachableNetwork / firewall / host reachabilityUser side — check host/port/allowlist
replication slot, publication, wal_levelCDC prerequisites brokenRun check-cdc-prerequisites-create; may need slot recreate
Schema exceeds row limit, billingBilling limitUpgrade plan or disable the schema

If latest_error is null but the schema is Failed, retrieve the schema directly — the error may only be populated on the detail view.

Step 4 — Pick the recovery action

The recovery action depends on root cause, not just status. Match the user's situation to one of these:

A. Transient failure (network blip, temporary API outage)

  • Data synced so far is still valid.
  • Action: external-data-schemas-reload to retry using the configured sync method.
  • Incremental/append syncs pick up where they left off.

B. Credentials expired or rotated

  • Every schema under the source is failing with an auth error.
  • Action: user rotates creds → external-data-sources-partial-update with the new job_inputs → the reload happens automatically when the source status flips back to running, or trigger manually with external-data-sources-reload.

C. Schema drift — column renamed, dropped, or type changed

  • Error mentions a specific column that no longer matches the source.
  • Action: external-data-schemas-incremental-fields-create to get the current fields, then external-data-schemas-partial-update with the corrected incremental_field / incremental_field_type / primary_key_columns. Usually no need to wipe data.

C2. Added / renamed tables in the source database

  • User mentions "I added a new table to Postgres but it isn't appearing", or a source table was renamed.
  • Action: external-data-sources-refresh-schemas to pick up the new table list, then configure sync on any new schemas.

D. Incremental state is wrong (duplicates, missing rows, data looks corrupt)

  • Schema status may be Completed — this isn't a "failure" per se, it's bad data.
  • Action: external-data-schemas-resync to wipe synced data and re-import from source. Destructive but often the right call for data-quality issues.

E. CDC pipeline broken on Postgres

  • Error mentions replication slot, publication, WAL.
  • Action: external-data-sources-check-cdc-prerequisites-create to enumerate what's broken, fix on the Postgres side, then external-data-schemas-reload. If the WAL position was lost, a resync is sometimes unavoidable.

F. Sync is stuck in Running for hours

  • Check last_synced_at. If it's hours old and still Running, the job is orphaned.
  • Action: external-data-schemas-cancel to stop it, then external-data-schemas-reload.

G. Table data is corrupt but you want to keep the schema config

  • Action: external-data-schemas-delete-data to drop the synced table but preserve the schema entry. Next reload re-imports from scratch without losing the configured sync_type/incremental_field.

H. Billing limit

  • Action isn't technical. Explain the limit, recommend upgrading the plan or disabling lower-priority schemas so the important ones fit under quota.

I. Webhook-backed schema isn't receiving events

  • Symptoms: schema has sync_type: "webhook", initial bulk sync finished, but no new rows arrive despite activity on the source side. Status may still read Completed because the bulk sync (the safety-net cadence) is succeeding — the problem is the push path.
  • Action:
    1. external-data-sources-webhook-info-retrieve({source_id}).
    2. If exists: false → the webhook was never registered, or was deleted. Call create-webhook-create to register it.
    3. If exists: true but external_status.error is set → typically "API key doesn't have permission to read webhooks" or similar. The webhook may have been deleted on the source's dashboard. Re-create it.
    4. If external_status.status isn't "enabled" → the source disabled the webhook (usually after repeated delivery failures). Re-enable or re-register.
    5. If payloads are arriving but failing signature verification → the signing secret was rotated. Get the new one from the source's dashboard and call update-webhook-inputs-create({source_id}, {inputs: {signing_secret: "..."}}).
  • After any fix, check the source's webhook logs (on their side) to confirm PostHog is now responding 2xx.

Step 5 — Confirm before destructive actions

Three recovery actions discard data and cannot be undone:

  • external-data-schemas-resync — wipes synced rows, re-imports from scratch
  • external-data-schemas-delete-data — drops the synced table
  • external-data-sources-destroy — deletes the source and all its schemas

Always present the fix you're proposing and wait for explicit approval before calling any of these. "Just try resync" is rarely the right default.

Example interaction

User: "Our Stripe sync is broken, can you check?"

Agent:
- external-data-sources-list → find Stripe source, status = Error
- external-data-sources-retrieve({id}) → latest_error: "authentication failed: 401 Unauthorized"
- Report: "Your Stripe source's API key is no longer authenticating.
   All 8 tables under it are failing with 401s. This usually means the key was rotated on the Stripe side.

   To fix:
   1. Grab a fresh restricted API key from the Stripe dashboard.
   2. I'll update the source with the new key.
   3. Syncs will resume automatically — no data loss.

   Paste the new key here when ready."

User: "sk_live_..."

Agent:
- external-data-sources-partial-update({id}, {job_inputs: {stripe_secret_key: "sk_live_..."}})
- external-data-sources-reload({id}) to trigger retry
- Report: "Updated and re-triggered. Check back in a few minutes — latest_error should clear."

Important notes

  • Source status overrides schema status for diagnosis. If the source is Error, nothing under it will work; fixing the source usually fixes all its schemas at once.
  • Running isn't always healthy. Cross-check last_synced_at. A sync stuck in Running needs cancel then reload, not resync.
  • Resync is destructive. It discards synced data. Only recommend it when the data itself is bad (duplicates, missing rows, corrupt) or when recovery genuinely requires a clean slate (lost WAL position on CDC). Never use it as a first-try for transient errors.
  • Delete-data preserves config. When a user says "I just want to start this table over from scratch", prefer delete-data + reload over resync + new schema entry — it keeps the configured sync_type / incremental_field / PK setup.
  • Billing limits aren't technical failures. Don't try to retry or reconfigure your way out. Route to billing.
  • Webhook failures can hide behind a green status. A webhook-type schema whose bulk fallback sync succeeded looks Completed even when the push channel is broken. When users say "my data is hours behind" on a webhook schema, call webhook-info-retrieve before looking at schema status. Webhook issues don't surface on external-data-schemas-list.

NotebookLM 网页导入器

一键将网页和 YouTube 视频导入 NotebookLM。超过 200,000 用户信赖。

安装 Chrome 扩展