triaging-visual-review-runs
Visual Review 是 PostHog 的截图回归产品:CI 会捕获 Storybook 和 Playwright 的截图,与已提交的基准哈希进行比对,并在人工批准可见变更之前阻止 PR 合并。带有视觉变更的 PR 会携带一个 visual-review GitHub 状态检查,该检查保持红色,直到每个差异截图在 VR UI 中被批准或容忍。
npx skills add https://github.com/posthog/ai-plugin --skill triaging-visual-review-runsTriaging visual review runs
Visual Review is PostHog's screenshot-regression product: CI captures storybook + playwright screenshots,
diffs them against committed baseline hashes, and gates the PR until a human approves the visible changes.
A PR with visual changes carries a visual-review GitHub status check that stays red until each diffed
snapshot is approved or tolerated in the VR UI.
This skill teaches an agent how to answer the questions a human reviewer would actually ask, by chaining
the VR MCP tools — instead of reaching for gh pr view and tab-hopping to the VR web UI. The read tools
cover status / scope / history / triage. Two are reversible DB-only triage marks (approve-create,
tolerate-create); one ships the change (finalize-create) — it commits the baseline and greens the gate,
and only that one needs explicit per-run human confirmation.
When this skill applies
Trigger this skill on any of:
- A PR number, branch name, or commit SHA paired with words like visual review, VR, snapshot, screenshot, storybook diff, playwright snapshot, baseline, approve, tolerated, quarantine.
- Questions about why a PR is blocked, what visually changed, or whether a diff is real.
- "Is my run done?" / "What's left to review?" / "Has this story flaked recently?"
- A failing
visual-reviewGitHub check or a PR comment from theposthog-botmentioning visual review.
When the user asks for the rendered diff image itself, the VR web UI is faster — direct them there. This skill is for everything around the diff: status, scope, history, triage.
Tools
Read tools (safe to call freely):
| Tool | Purpose |
|---|---|
posthog:visual-review-runs-list | List runs, filter by pr_number / commit_sha / branch / review_state. Start here. |
posthog:visual-review-runs-retrieve | Full detail for a single run (status, summary counts, supersession). |
posthog:visual-review-runs-snapshots-list | Per-snapshot results inside a run: identifier, result, diff %, classification, baseline + current artifact URLs. Quarantined snapshots are excluded by default (see quarantined_count); pass include_quarantined=true to see them. |
posthog:visual-review-runs-snapshot-history-list | A single story's last N runs across master/PRs — the flake check. |
posthog:visual-review-runs-counts-retrieve | Aggregate counts for queue triage (how many runs in needs_review, etc.). |
posthog:visual-review-runs-tolerated-hashes-list | Hashes the team has explicitly accepted as "known flake / acceptable variation". |
posthog:visual-review-repos-list | Repos (one per GitHub repo) — usually only one matters; useful for filtering. |
posthog:visual-review-repos-retrieve | Repo metadata: baseline file paths, PR-comment configuration. |
Triage tools (reversible, DB-only — they record a review decision but do NOT change the baseline or the gate):
| Tool | Purpose |
|---|---|
posthog:visual-review-runs-approve-create | Mark changed / new snapshots reviewed (approved) in the DB. Does NOT commit or green the gate — ship via finalize. |
posthog:visual-review-runs-tolerate-create | Mark a single changed snapshot as a known tolerated alternate. Does NOT change the baseline — use for benign variants. |
Ship tool (irreversible, outward-facing — requires explicit per-run human confirmation; see the gate):
| Tool | Purpose |
|---|---|
posthog:visual-review-runs-finalize-create | Commit the approved baseline to the PR branch and green the GitHub visual-review check. This ships the change. |
Mark-reviewed call shape (approve-create):
id(required) — the run UUID. It's the route parameter, so the call fails without it.snapshots: [{identifier, new_hash}]—new_hashis thecontent_hashof each snapshot'scurrent_artifact. This only records the review in the DB; nothing is committed and the gate stays red until you finalize.
Toleration call shape — both fields are required:
id(required) — the run UUID. It's the route parameter, so the call fails without it.snapshot_id(required) — the UUID of the individual snapshot to tolerate (fromvisual-review-runs-snapshots-list). This identifies which snapshot inside the run; it does not replace the runid.
Finalize call shape (finalize-create) — the all-or-nothing ship action:
id(required) — the run UUID.approve_all: true— approve every still-pendingchanged/newsnapshot before finalizing (tolerated ones are left alone). Use when you've verified every remaining diff is intended.- Omit
approve_all(default false) to finalize a run you've already reviewed snapshot-by-snapshot. Finalize is all-or-nothing: it fails with409 not_fully_resolved(and lists what's left) unless every changed/new snapshot is approved, tolerated, or quarantined. - It commits exactly the snapshots approved in the DB — tolerated snapshots keep their baseline and are never overwritten. When a baseline commit is pushed, its SHA comes back on the run's
metadata.baseline_commit_sha. It's absent when nothing needed committing (everything resolved by toleration/quarantine — the gate still greens) or when the commit was skipped: no PR, or a409 sha_mismatchbecause the PR has newer commits (that one leaves the gate red — re-run CI on the latest commit and finalize again).
If finalize fails with 409 stale_run, the run has been superseded — visual-review-runs-list { pr_number } and finalize the newest one. A successful finalize often kicks off a fresh CI run, which is normal.
The finalize gate
Finalize is the one irreversible, outward-facing action in this skill: it rewrites the baseline committed to the PR and greens the merge gate. Treat it like pushing to someone's branch — never automatic.
Before any finalize-create call, all of these must hold:
- You verified the diffs. You pulled the current (and, for
changed, baseline) PNGs and looked at them, ran the flake check on anything suspect, and reached a per-snapshot verdict. Metadata alone is never enough. - You presented the verdict and waited. Show the user, per snapshot, what changed and your recommendation, then stop.
- The user explicitly approved this run. A broad "get the gate green" / "fix the PR" is permission to investigate and recommend — NOT to finalize. When the task implies finalizing but the human hasn't said it for this specific run, ask.
approve-create and tolerate-create are reversible triage and don't need this gate — but they don't ship anything
either. The moment you're about to finalize-create and can't point to a specific human "yes" for this run, stop and ask.
Vocabulary cheat sheet
These appear in tool output and matter for interpretation:
- Run
review_state:needs_review(open, awaiting human),clean(zero diffs),processing(CI still uploading),stale(a newer run on the same PR has superseded this one — checksuperseded_by_id). - Run
run_type:storybook(component snapshots) orplaywright(full-page e2e snapshots). - Snapshot
result:unchanged,changed(real diff),new(no baseline yet),removed. - Snapshot
classification_reason:tolerated_hash(matches a known-tolerated hash, no action needed),below_threshold(under the noise floor),exact(byte-identical),""(real diff requiring review). - Snapshot
review_state:pendingorapproved. - Run
summary:total / changed / new / removed / unchanged / unresolved / tolerated_matched—unresolvedis what's actually blocking review.
Workflows
"What's the VR status of this PR?"
The single most common job. Map a PR number to its run state in two calls.
posthog:visual-review-runs-list { pr_number: <n>, limit: 5 }— sort bycreated_atdesc, take the latest non-stale one.- If the run has
summary.changed > 0orsummary.unresolved > 0, drill in:posthog:visual-review-runs-snapshots-list { id: <run_id> }and report thechangedsnapshots.
Report back: PR number, run UUID, review_state, summary counts, and the _posthogUrl deep link so the
user can click straight to the diff viewer.
"Is the diff real or unrelated?"
The most useful judgment a code-aware agent can add. Combine three signals: scope match, flake history, and the actual rendered images. The agent should look at the screenshots — not just describe metadata.
-
Scope check —
git diff master...HEAD --stat(or against the PR's base branch) → list of touched paths. Cross-reference withposthog:visual-review-runs-snapshots-list { id }filtered toresult: changed→ story identifiers. Stories are namespaced like<area>-<scene>--<story>--<theme>; e.g.scenes-app-settings-user--settings-user-profile--darkmaps tofrontend/src/scenes/settings/user/.... Use this to translate story id → likely source path. -
Visual inspection — for each
changedsnapshot, the tool result containscurrent_artifact.download_urlandbaseline_artifact.download_url. These are pre-signed S3 URLs to PNG files; pull them and look:curl -s -o /tmp/vr-baseline.png "<baseline_artifact.download_url>" curl -s -o /tmp/vr-current.png "<current_artifact.download_url>"Then
Readboth files (the Read tool renders images visually) and compare. Things to call out:- The actual visible delta (text changed, button moved, layout shift, color drift, missing element).
- Whether the change is consistent with the diff_pixel_count and diff_percentage in the metadata (e.g. 54% diff but the images look near-identical → screenshot framing changed, not the UI).
- Whether the baseline and current have different dimensions (
width/heightfields). Mismatched dimensions usually mean the story rendered to a different viewport or didn't fully render before screenshot — a flake signal, not a regression.
-
Flake history — run the flake check below for any story that looks suspect.
-
Verdict — combine all three:
- Scope plausible + visible regression matches the code change → real diff, recommend approval.
- Scope mismatch + dimensions mismatch + frequent prior changes → flake, recommend tolerating the hash.
- Scope plausible + visible regression looks unintended → push a fix; do not approve.
Always include a one-line description of what you saw in the images — the user uses this to decide whether to trust your verdict without opening the VR UI themselves.
Flake check: "Has this story been changing?"
Once you have a suspect snapshot identifier:
posthog:visual-review-runs-snapshot-history-list { id: <snapshot_id> } → returns prior outcomes for the same story.
Verdicts:
- Mostly
unchangedand this run's diff is the outlier → likely a real regression caused by this PR. - Frequent
changedacross unrelated branches/master → flaky story; recommend tolerating the hash via the UI. - Recent
removedor large-jump dimension change → baseline likely stale; recommend re-baselining on master.
Triaging the queue
When the user is doing housekeeping rather than asking about a specific PR:
posthog:visual-review-runs-counts-retrieve→ total queue size.posthog:visual-review-runs-list { review_state: needs_review, limit: 50 }(paginate if needed).- Group by
branchauthor orrun_typeto surface clusters (e.g., "12 PRs blocked on the same shared component change" usually means a single underlying root cause to address). - Prefer surfacing runs whose
summary.changed > 0over runs that are onlynew—newmeans no baseline yet, which is usually trivial to approve;changedis the real review work.
Output expectations
For PR-status questions, lead with the verdict in one line, then 2-4 bullets of supporting context. Always
include the _posthogUrl deep link to the run — humans need to see the rendered images to make the call,
the agent can only describe the metadata.
For triage / aggregate questions, a short table beats prose. Group by what the user is going to act on.
What NOT to do
- Do not approve or tolerate without explicit user confirmation. The verdict is yours to recommend; the decision to ship belongs to the user. Once they say "approve those" / "tolerate that", call the tool.
- Do not assume the failing GitHub check on a PR is unrelated to VR — if a
visual-reviewcheck is red on a PR you're working on, that's the trigger to run this skill. - Do not declare a verdict from metadata alone when
result: changed. Pull the baseline and current PNGs and look at them; metadata can only say "something changed", not whether the change is intended.