triaging-visual-review-runs
विज़ुअल रिव्यू, PostHog का स्क्रीनशॉट-रिग्रेशन उत्पाद है: CI स्टोरीबुक और प्लेराइट स्क्रीनशॉट कैप्चर करता है, उन्हें प्रतिबद्ध बेसलाइन हैश के विरुद्ध डिफ करता है, और मानव द्वारा दृश्य परिवर्तनों को स्वीकृत किए जाने तक PR को रोकता है। दृश्य परिवर्तनों वाले PR पर एक विज़ुअल-रिव्यू GitHub स्थिति जांच होती है जो तब तक लाल रहती है जब तक कि VR UI में प्रत्येक डिफ़ किए गए स्नैपशॉट को स्वीकृत या सह
npx skills add https://github.com/posthog/ai-plugin --skill triaging-visual-review-runsTriaging visual review runs
Visual Review is PostHog's screenshot-regression product: CI captures storybook + playwright screenshots,
diffs them against committed baseline hashes, and gates the PR until a human approves the visible changes.
A PR with visual changes carries a visual-review GitHub status check that stays red until each diffed
snapshot is approved or tolerated in the VR UI.
This skill teaches an agent how to answer the questions a human reviewer would actually ask, by chaining
the VR MCP tools — instead of reaching for gh pr view and tab-hopping to the VR web UI. The read tools
cover status / scope / history / triage. Two are reversible DB-only triage marks (approve-create,
tolerate-create); one ships the change (finalize-create) — it commits the baseline and greens the gate,
and only that one needs explicit per-run human confirmation.
When this skill applies
Trigger this skill on any of:
- A PR number, branch name, or commit SHA paired with words like visual review, VR, snapshot, screenshot, storybook diff, playwright snapshot, baseline, approve, tolerated, quarantine.
- Questions about why a PR is blocked, what visually changed, or whether a diff is real.
- "Is my run done?" / "What's left to review?" / "Has this story flaked recently?"
- A failing
visual-reviewGitHub check or a PR comment from theposthog-botmentioning visual review.
When the user asks for the rendered diff image itself, the VR web UI is faster — direct them there. This skill is for everything around the diff: status, scope, history, triage.
Tools
Read tools (safe to call freely):
| Tool | Purpose |
|---|---|
posthog:visual-review-runs-list | List runs, filter by pr_number / commit_sha / branch / review_state. Start here. |
posthog:visual-review-runs-retrieve | Full detail for a single run (status, summary counts, supersession). |
posthog:visual-review-runs-snapshots-list | Per-snapshot results inside a run: identifier, result, diff %, classification, baseline + current artifact URLs. Quarantined snapshots are excluded by default (see quarantined_count); pass include_quarantined=true to see them. |
posthog:visual-review-runs-snapshot-history-list | A single story's last N runs across master/PRs — the flake check. |
posthog:visual-review-runs-counts-retrieve | Aggregate counts for queue triage (how many runs in needs_review, etc.). |
posthog:visual-review-runs-tolerated-hashes-list | Hashes the team has explicitly accepted as "known flake / acceptable variation". |
posthog:visual-review-repos-list | Repos (one per GitHub repo) — usually only one matters; useful for filtering. |
posthog:visual-review-repos-retrieve | Repo metadata: baseline file paths, PR-comment configuration. |
Triage tools (reversible, DB-only — they record a review decision but do NOT change the baseline or the gate):
| Tool | Purpose |
|---|---|
posthog:visual-review-runs-approve-create | Mark changed / new snapshots reviewed (approved) in the DB. Does NOT commit or green the gate — ship via finalize. |
posthog:visual-review-runs-tolerate-create | Mark a single changed snapshot as a known tolerated alternate. Does NOT change the baseline — use for benign variants. |
Ship tool (irreversible, outward-facing — requires explicit per-run human confirmation; see the gate):
| Tool | Purpose |
|---|---|
posthog:visual-review-runs-finalize-create | Commit the approved baseline to the PR branch and green the GitHub visual-review check. This ships the change. |
Mark-reviewed call shape (approve-create):
id(required) — the run UUID. It's the route parameter, so the call fails without it.snapshots: [{identifier, new_hash}]—new_hashis thecontent_hashof each snapshot'scurrent_artifact. This only records the review in the DB; nothing is committed and the gate stays red until you finalize.
Toleration call shape — both fields are required:
id(required) — the run UUID. It's the route parameter, so the call fails without it.snapshot_id(required) — the UUID of the individual snapshot to tolerate (fromvisual-review-runs-snapshots-list). This identifies which snapshot inside the run; it does not replace the runid.
Finalize call shape (finalize-create) — the all-or-nothing ship action:
id(required) — the run UUID.approve_all: true— approve every still-pendingchanged/newsnapshot before finalizing (tolerated ones are left alone). Use when you've verified every remaining diff is intended.- Omit
approve_all(default false) to finalize a run you've already reviewed snapshot-by-snapshot. Finalize is all-or-nothing: it fails with409 not_fully_resolved(and lists what's left) unless every changed/new snapshot is approved, tolerated, or quarantined. - It commits exactly the snapshots approved in the DB — tolerated snapshots keep their baseline and are never overwritten. When a baseline commit is pushed, its SHA comes back on the run's
metadata.baseline_commit_sha. It's absent when nothing needed committing (everything resolved by toleration/quarantine — the gate still greens) or when the commit was skipped: no PR, or a409 sha_mismatchbecause the PR has newer commits (that one leaves the gate red — re-run CI on the latest commit and finalize again).
If finalize fails with 409 stale_run, the run has been superseded — visual-review-runs-list { pr_number } and finalize the newest one. A successful finalize often kicks off a fresh CI run, which is normal.
The finalize gate
Finalize is the one irreversible, outward-facing action in this skill: it rewrites the baseline committed to the PR and greens the merge gate. Treat it like pushing to someone's branch — never automatic.
Before any finalize-create call, all of these must hold:
- You verified the diffs. You pulled the current (and, for
changed, baseline) PNGs and looked at them, ran the flake check on anything suspect, and reached a per-snapshot verdict. Metadata alone is never enough. - You presented the verdict and waited. Show the user, per snapshot, what changed and your recommendation, then stop.
- The user explicitly approved this run. A broad "get the gate green" / "fix the PR" is permission to investigate and recommend — NOT to finalize. When the task implies finalizing but the human hasn't said it for this specific run, ask.
approve-create and tolerate-create are reversible triage and don't need this gate — but they don't ship anything
either. The moment you're about to finalize-create and can't point to a specific human "yes" for this run, stop and ask.
Vocabulary cheat sheet
These appear in tool output and matter for interpretation:
- Run
review_state:needs_review(open, awaiting human),clean(zero diffs),processing(CI still uploading),stale(a newer run on the same PR has superseded this one — checksuperseded_by_id). - Run
run_type:storybook(component snapshots) orplaywright(full-page e2e snapshots). - Snapshot
result:unchanged,changed(real diff),new(no baseline yet),removed. - Snapshot
classification_reason:tolerated_hash(matches a known-tolerated hash, no action needed),below_threshold(under the noise floor),exact(byte-identical),""(real diff requiring review). - Snapshot
review_state:pendingorapproved. - Run
summary:total / changed / new / removed / unchanged / unresolved / tolerated_matched—unresolvedis what's actually blocking review.
Workflows
"What's the VR status of this PR?"
The single most common job. Map a PR number to its run state in two calls.
posthog:visual-review-runs-list { pr_number: <n>, limit: 5 }— sort bycreated_atdesc, take the latest non-stale one.- If the run has
summary.changed > 0orsummary.unresolved > 0, drill in:posthog:visual-review-runs-snapshots-list { id: <run_id> }and report thechangedsnapshots.
Report back: PR number, run UUID, review_state, summary counts, and the _posthogUrl deep link so the
user can click straight to the diff viewer.
"Is the diff real or unrelated?"
The most useful judgment a code-aware agent can add. Combine three signals: scope match, flake history, and the actual rendered images. The agent should look at the screenshots — not just describe metadata.
-
Scope check —
git diff master...HEAD --stat(or against the PR's base branch) → list of touched paths. Cross-reference withposthog:visual-review-runs-snapshots-list { id }filtered toresult: changed→ story identifiers. Stories are namespaced like<area>-<scene>--<story>--<theme>; e.g.scenes-app-settings-user--settings-user-profile--darkmaps tofrontend/src/scenes/settings/user/.... Use this to translate story id → likely source path. -
Visual inspection — for each
changedsnapshot, the tool result containscurrent_artifact.download_urlandbaseline_artifact.download_url. These are pre-signed S3 URLs to PNG files; pull them and look:curl -s -o /tmp/vr-baseline.png "<baseline_artifact.download_url>" curl -s -o /tmp/vr-current.png "<current_artifact.download_url>"Then
Readboth files (the Read tool renders images visually) and compare. Things to call out:- The actual visible delta (text changed, button moved, layout shift, color drift, missing element).
- Whether the change is consistent with the diff_pixel_count and diff_percentage in the metadata (e.g. 54% diff but the images look near-identical → screenshot framing changed, not the UI).
- Whether the baseline and current have different dimensions (
width/heightfields). Mismatched dimensions usually mean the story rendered to a different viewport or didn't fully render before screenshot — a flake signal, not a regression.
-
Flake history — run the flake check below for any story that looks suspect.
-
Verdict — combine all three:
- Scope plausible + visible regression matches the code change → real diff, recommend approval.
- Scope mismatch + dimensions mismatch + frequent prior changes → flake, recommend tolerating the hash.
- Scope plausible + visible regression looks unintended → push a fix; do not approve.
Always include a one-line description of what you saw in the images — the user uses this to decide whether to trust your verdict without opening the VR UI themselves.
Flake check: "Has this story been changing?"
Once you have a suspect snapshot identifier:
posthog:visual-review-runs-snapshot-history-list { id: <snapshot_id> } → returns prior outcomes for the same story.
Verdicts:
- Mostly
unchangedand this run's diff is the outlier → likely a real regression caused by this PR. - Frequent
changedacross unrelated branches/master → flaky story; recommend tolerating the hash via the UI. - Recent
removedor large-jump dimension change → baseline likely stale; recommend re-baselining on master.
Triaging the queue
When the user is doing housekeeping rather than asking about a specific PR:
posthog:visual-review-runs-counts-retrieve→ total queue size.posthog:visual-review-runs-list { review_state: needs_review, limit: 50 }(paginate if needed).- Group by
branchauthor orrun_typeto surface clusters (e.g., "12 PRs blocked on the same shared component change" usually means a single underlying root cause to address). - Prefer surfacing runs whose
summary.changed > 0over runs that are onlynew—newmeans no baseline yet, which is usually trivial to approve;changedis the real review work.
Output expectations
For PR-status questions, lead with the verdict in one line, then 2-4 bullets of supporting context. Always
include the _posthogUrl deep link to the run — humans need to see the rendered images to make the call,
the agent can only describe the metadata.
For triage / aggregate questions, a short table beats prose. Group by what the user is going to act on.
What NOT to do
- Do not approve or tolerate without explicit user confirmation. The verdict is yours to recommend; the decision to ship belongs to the user. Once they say "approve those" / "tolerate that", call the tool.
- Do not assume the failing GitHub check on a PR is unrelated to VR — if a
visual-reviewcheck is red on a PR you're working on, that's the trigger to run this skill. - Do not declare a verdict from metadata alone when
result: changed. Pull the baseline and current PNGs and look at them; metadata can only say "something changed", not whether the change is intended.