distributed-triagepar pytorch
Sub-triages issues in the oncall:distributed queue by assigning distributed module labels, routing to sub-oncalls, and marking triaged. Use when an issue has…
npx skills add https://github.com/pytorch/pytorch --skill distributed-triageDistributed Issue Triage Sub-Skill
This sub-skill picks up where the PT-level triage bot leaves off. It processes issues that already have the oncall: distributed label and performs second-level triage: routing to a distributed sub-oncall, classifying by module, and marking triaged.
Contents
- MCP Tools Available
- Reference Files
- Distributed Triage Steps
- Step 0: Already Triaged by Human
- Step 1: Is This Actually a Distributed Issue?
- Step 2: Route to Distributed Sub-Oncall
- Step 3: Classify Module
- Step 4: Type Labels
- Step 5: High Priority — REQUIRES HUMAN REVIEW
- Step 6: Missing Reproduction
- Constraints
Distributed labels reference: See distributed-labels.json for the labels this skill is allowed to apply. ONLY apply labels from this file.
Distributed triage rubric: See distributed-rubric.md for detailed routing guidance, module classification signals, and confidence calibration.
Response templates: See templates.json for distributed-specific comment templates.
MCP Tools Available
Use these GitHub MCP tools for triage:
| Tool | Purpose |
|---|---|
mcp__github__issue_read | Get issue details, comments, and existing labels |
mcp__github__issue_write | Apply labels or close issues |
mcp__github__add_issue_comment | Add comment (only for reproduction requests or mislabel flags) |
mcp__github__search_issues | Find similar issues for context |
Distributed Triage Steps
0) Already Triaged by Human?
A human has fully classified the issue only when it has BOTH:
- Any
module:label listed in distributed-labels.json, AND - One of the sub-oncall labels:
oncall: distributed parallelisms,oncall: distributed infra, oroncall: distributed checkpointing.
If both are present:
- Add
ptd-bot-triagedlabel - STOP — a human already classified this issue.
If only one is present (a module label without a sub-oncall, or a sub-oncall without a module label), triage is incomplete — proceed to Step 1. The PT-level triage bot can apply distributed module labels alongside oncall: distributed, but it does not pick the sub-oncall; that is your job.
This step alone should clear a large portion of the backlog.
1) Is This Actually a Distributed Issue?
Read the issue title, description, and comments. Determine whether the issue is actually related to distributed training.
Signs it is NOT a distributed issue:
- Single-GPU issue with no distributed code (e.g.,
torch.nnon one GPU, CUDA OOM on one device) - Build/packaging issue (e.g.,
undefined symbol: ncclAlltoAllatimport torchwith no distributed code) - Pure
torch.compileissue with no distributed component - Issue about a domain library (vision, text, audio) that happens to mention "distributed"
If NOT a distributed issue:
- Add
triage review+ptd-bot-triagedlabels - Post a comment using the
not_distributedtemplate from templates.json - Do NOT remove
oncall: distributed— let the human oncall re-route - STOP
2) Route to Distributed Sub-Oncall
Each issue carries exactly ONE sub-oncall label. If the issue already has one of the three sub-oncall labels (oncall: distributed parallelisms, oncall: distributed infra, or oncall: distributed checkpointing), keep it as-is — do NOT add a second sub-oncall, even if your own classification would have picked a different one. Use the existing sub-oncall to decide the next step (continue to Step 3 if it's oncall: distributed parallelisms; otherwise add ptd-bot-triaged and STOP per the rules below).
If no sub-oncall is present, apply exactly one based on the routing rules in distributed-rubric.md:
| Sub-Oncall Label | When to Apply |
|---|---|
oncall: distributed parallelisms | FSDP, DDP, DTensor, tensor parallel, context parallel, pipeline parallel. This is the default when unsure. |
oncall: distributed infra | c10d, process groups, collectives, NCCL/Gloo/MPI backends, elastic/torchrun, RPC, stores, distributed tools, DeviceMesh, symmetric memory |
oncall: distributed checkpointing | Distributed checkpoint save/load, DCP, state_dict utilities, async checkpointing |
Use the routing decision tree and edge cases in distributed-rubric.md Section 1 to determine the correct sub-oncall.
After routing to oncall: distributed infra or oncall: distributed checkpointing:
- Add
ptd-bot-triaged - STOP — the sub-oncall team owns further triage
After routing to oncall: distributed parallelisms:
- Continue to Step 3 for module classification
3) Classify Module
From the issue description, comments, code snippets, and stack traces, classify into one or more distributed modules. Consult the module classification signals in distributed-rubric.md.
Confidence-based actions:
| Confidence | Criteria | Action |
|---|---|---|
| HIGH or MEDIUM | Explicit module mention, obvious API usage, or probable module based on context | Add module: label(s) + ptd-bot-triaged |
| LOW | Cannot determine module — vague description, no code, no stack trace | Add triage review + ptd-bot-triaged |
Rules:
- You can apply multiple module labels when the issue spans modules (e.g.,
module: fsdp+module: dtensorfor FSDP2 issues that hit DTensor bugs). - When an issue has
oncall: pt2already applied, do NOT remove it. Add distributed module labels alongside it. - When the module is unclear, add
triage review+ptd-bot-triaged— do NOT guess a module label.
4) Type Labels
If the issue is not a bug report, add the appropriate type label:
feature— wholly new functionality that does not exist today in any formenhancement— improvement to something that already works (e.g., performance optimization, better error messages, adding a native backend for an op that already runs via fallback)
Most distributed issues are bug reports — do not add a type label for bugs. If the issue says the operation "currently works" or "falls back to" a slower path, that is enhancement, not feature. If the enhancement is about performance, also add module: performance.
5) High Priority — REQUIRES HUMAN REVIEW
CRITICAL: If you believe an issue is high priority, you MUST:
- Add
triage reviewlabel and do NOT addptd-bot-triaged
Do NOT directly add high priority without human confirmation.
High priority criteria for distributed issues:
- Crash / segfault / illegal memory access in distributed code
- Silent correctness issue (wrong results from collectives, incorrect gradient sync)
- Regression from a prior version (e.g., FSDP worked in 2.x, broken in 2.y)
- Hang affecting multi-node training (NCCL timeout, deadlock in collectives)
- Data corruption during distributed checkpointing
- Internal assert failure in c10d or process group code
- Many users affected or core distributed component impacted
6) Missing Reproduction
If the issue lacks a minimal reproduction script:
- Add
needs reproduction+ptd-bot-triagedlabels - Post a comment using the
needs_distributed_reproductiontemplate from templates.json
Do NOT request reproduction when:
- The issue already has a code snippet, script, or steps that someone could follow to reproduce
- The issue is a feature request (no repro needed)
- A multi-node script is provided (that counts as reproduction even if you can't run it locally)
Constraints
DO NOT:
- Close issues (only the PT-level bot or humans close issues)
- Remove existing labels — only add labels
- Remove
oncall: distributed— it stays even if the issue is mislabeled - Remove
oncall: pt2— if already present, keep it - Remove
bot-triaged— it is applied by the parent skill and must stay - Add labels not in distributed-labels.json
- Add comments to issues except when using the templates in Step 1 (mislabel) or Step 6 (reproduction)
- Assign issues to users
- Add
high prioritydirectly — usetriage reviewand let humans decide
DO:
- Be conservative — when in doubt, add
triage reviewfor human attention - Add
ptd-bot-triagedwhenever the bot has processed the issue, regardless of confidence. Pair withtriage reviewfor LOW-confidence or uncertain cases so the cron sweep won't re-pick it. (Exception: §5 high-priority flow intentionally omitsptd-bot-triaged.) - Always add a sub-oncall label (Step 2) before module labels (Step 3)
- Read the full issue including comments before classifying
- Check the rubric's "Common Mislabel Traps" section before finalizing
Note: bot-triaged is automatically applied by the parent skill's post-hook after any issue mutation. You do not need to add it manually.