distributed-triage

작성자: pytorch

온콜:분산 큐의 이슈를 하위 분류하여 분산 모듈 레이블을 할당하고, 하위 온콜로 라우팅하며, 분류 완료로 표시합니다. 이슈가 다음과 같은 경우 사용합니다…

npx skills add https://github.com/pytorch/pytorch --skill distributed-triage

Distributed Issue Triage Sub-Skill

This sub-skill picks up where the PT-level triage bot leaves off. It processes issues that already have the oncall: distributed label and performs second-level triage: routing to a distributed sub-oncall, classifying by module, and marking triaged.

Contents

Distributed labels reference: See distributed-labels.json for the labels this skill is allowed to apply. ONLY apply labels from this file.

Distributed triage rubric: See distributed-rubric.md for detailed routing guidance, module classification signals, and confidence calibration.

Response templates: See templates.json for distributed-specific comment templates.


MCP Tools Available

Use these GitHub MCP tools for triage:

ToolPurpose
mcp__github__issue_readGet issue details, comments, and existing labels
mcp__github__issue_writeApply labels or close issues
mcp__github__add_issue_commentAdd comment (only for reproduction requests or mislabel flags)
mcp__github__search_issuesFind similar issues for context

Comment Deduplication

Before adding any issue comment:

  1. Read the existing comments with mcp__github__issue_read.
  2. Check whether the triage bot has already posted the same template or a substantially equivalent request/explanation.
  3. If a duplicate exists, do not add another comment. Continue with any non-comment actions that are still needed, such as labels.

Treat a comment as duplicate even if the wording differs slightly or an older template version was used. For distributed triage, this includes an existing distributed reproduction request or an existing "not distributed" notice.


Distributed Triage Steps

0) Already Triaged by Human?

A human has fully classified the issue only when it has BOTH:

  1. Any module: label listed in distributed-labels.json, AND
  2. One of the sub-oncall labels: oncall: distributed parallelisms, oncall: distributed infra, or oncall: distributed checkpointing.

If both are present:

  • Add bot-triaged + triaged labels (the human classification is complete and confident)
  • STOP — a human already classified this issue.

If only one is present (a module label without a sub-oncall, or a sub-oncall without a module label), triage is incomplete — proceed to Step 1. The PT-level triage bot can apply distributed module labels alongside oncall: distributed, but it does not pick the sub-oncall; that is your job.

This step alone should clear a large portion of the backlog.

1) Is This Actually a Distributed Issue?

Read the issue title, description, and comments. Determine whether the issue is actually related to distributed training.

Signs it is NOT a distributed issue:

  • Single-GPU issue with no distributed code (e.g., torch.nn on one GPU, CUDA OOM on one device)
  • Build/packaging issue (e.g., undefined symbol: ncclAlltoAll at import torch with no distributed code)
  • Pure torch.compile issue with no distributed component
  • Issue about a domain library (vision, text, audio) that happens to mention "distributed"

If NOT a distributed issue:

  1. Add triage review + bot-triaged labels
  2. Post a comment using the not_distributed template from templates.json, unless an equivalent "not distributed" comment already exists
  3. Do NOT remove oncall: distributed — let the human oncall re-route
  4. STOP

2) Route to Distributed Sub-Oncall

Each issue carries exactly ONE sub-oncall label. If the issue already has one of the three sub-oncall labels (oncall: distributed parallelisms, oncall: distributed infra, or oncall: distributed checkpointing), keep it as-is — do NOT add a second sub-oncall, even if your own classification would have picked a different one. Use the existing sub-oncall to decide the next step (continue to Step 3 if it's oncall: distributed parallelisms; otherwise add bot-triaged and STOP per the rules below).

If no sub-oncall is present, apply exactly one based on the routing rules in distributed-rubric.md:

Sub-Oncall LabelWhen to Apply
oncall: distributed parallelismsFSDP, DDP, DTensor, tensor parallel, context parallel, pipeline parallel. This is the default when unsure.
oncall: distributed infrac10d, process groups, collectives, NCCL/Gloo/MPI backends, elastic/torchrun, RPC, stores, distributed tools, DeviceMesh, symmetric memory
oncall: distributed checkpointingDistributed checkpoint save/load, DCP, state_dict utilities, async checkpointing

Use the routing decision tree and edge cases in distributed-rubric.md Section 1 to determine the correct sub-oncall.

After routing to oncall: distributed infra or oncall: distributed checkpointing:

  • Add bot-triaged (the sub-oncall routing is a confident, complete outcome)
  • STOP — the sub-oncall team owns further triage

After routing to oncall: distributed parallelisms:

  • Continue to Step 3 for module classification

3) Classify Module

From the issue description, comments, code snippets, and stack traces, classify into one or more distributed modules. Consult the module classification signals in distributed-rubric.md.

Confidence-based actions:

ConfidenceCriteriaAction
HIGH or MEDIUMExplicit module mention, obvious API usage, or probable module based on contextAdd module: label(s) + bot-triaged + triaged
LOWCannot determine module — vague description, no code, no stack traceAdd triage review + bot-triaged (no triaged — punting to a human)

Rules:

  • You can apply multiple module labels when the issue spans modules (e.g., module: fsdp + module: dtensor for FSDP2 issues that hit DTensor bugs).
  • When an issue has oncall: pt2 already applied, do NOT remove it. Add distributed module labels alongside it.
  • When the module is unclear, add triage review + bot-triaged — do NOT guess a module label.

4) Type Labels

If the issue is not a bug report, add the appropriate type label:

  • feature — wholly new functionality that does not exist today in any form
  • enhancement — improvement to something that already works (e.g., performance optimization, better error messages, adding a native backend for an op that already runs via fallback)

Most distributed issues are bug reports — do not add a type label for bugs. If the issue says the operation "currently works" or "falls back to" a slower path, that is enhancement, not feature. If the enhancement is about performance, also add module: performance.

5) High Priority — REQUIRES HUMAN REVIEW

CRITICAL: If you believe an issue is high priority, you MUST:

  1. Add triage review label and do NOT add bot-triaged

Do NOT directly add high priority without human confirmation.

High priority criteria for distributed issues:

  • Crash / segfault / illegal memory access in distributed code
  • Silent correctness issue (wrong results from collectives, incorrect gradient sync)
  • Regression from a prior version (e.g., FSDP worked in 2.x, broken in 2.y)
  • Hang affecting multi-node training (NCCL timeout, deadlock in collectives)
  • Data corruption during distributed checkpointing
  • Internal assert failure in c10d or process group code
  • Many users affected or core distributed component impacted

6) Missing Reproduction

If the issue lacks a minimal reproduction script:

  1. Add needs reproduction + bot-triaged labels
  2. Post a comment using the needs_distributed_reproduction template from templates.json, unless an equivalent distributed reproduction request already exists

Do NOT request reproduction when:

  • The issue already has a code snippet, script, or steps that someone could follow to reproduce
  • The issue is a feature request (no repro needed)
  • A multi-node script is provided (that counts as reproduction even if you can't run it locally)

Constraints

DO NOT:

  • Close issues (only the PT-level bot or humans close issues)
  • Remove existing labels — only add labels
  • Remove oncall: distributed — it stays even if the issue is mislabeled
  • Remove oncall: pt2 — if already present, keep it
  • Remove bot-triaged or triaged — they are applied by the parent skill and must stay
  • Add triaged when you are NOT confident in the classification — i.e. any time the action also applies triage review or needs reproduction, or in the §5 high-priority flow
  • Add labels not in distributed-labels.json
  • Add comments to issues except when using the templates in Step 1 (mislabel) or Step 6 (reproduction)
  • Add a comment when the bot has already posted the same template or a substantially equivalent message on the issue
  • Assign issues to users
  • Add high priority directly — use triage review and let humans decide

DO:

  • Be conservative — when in doubt, add triage review for human attention
  • Add bot-triaged whenever the bot has processed the issue, regardless of confidence. Pair with triage review for LOW-confidence or uncertain cases so the cron sweep won't re-pick it. (Exception: §5 high-priority flow intentionally omits bot-triaged.)
  • Add triaged ONLY when you reach a confident, complete classification: a human already classified it (Step 0), a confident sub-oncall routing (Step 2), or a HIGH/MEDIUM-confidence module classification (Step 3).
  • Always add a sub-oncall label (Step 2) before module labels (Step 3)
  • Read the full issue including comments before classifying
  • Read existing comments before every comment action and skip duplicate bot messages
  • Check the rubric's "Common Mislabel Traps" section before finalizing

pytorch의 다른 스킬

add-uint-support
pytorch
PyTorch 연산자에 부호 없는 정수(uint) 타입 지원을 추가하기 위해 AT_DISPATCH 매크로를 업데이트합니다. uint16, uint32, uint64 타입에 대한 지원을 추가할 때 사용합니다.
official
aoti-debug
pytorch
AOTInductor(AOTI) 오류 및 충돌을 디버깅합니다. AOTI 세그폴트, 장치 불일치 오류, 상수 로딩 실패 또는 런타임 오류가 발생할 때 사용하세요.
official
at-dispatch-v2
pytorch
PyTorch AT_DISPATCH 매크로를 ATen C++ 코드에서 AT_DISPATCH_V2 형식으로 변환합니다. AT_DISPATCH_ALL_TYPES_AND*, AT_DISPATCH_FLOATING_TYPES* 등을 포팅할 때 사용합니다.
official
docstring
pytorch
PyTorch 함수와 메서드에 대해 PyTorch 규칙을 따라 docstring을 작성합니다. PyTorch 코드에서 docstring을 작성하거나 업데이트할 때 사용하세요.
official
document-public-apis
pytorch
PyTorch에서 문서화되지 않은 공개 API를 문서화하려면 docs/source/conf.py의 coverage_ignore_functions와 coverage_ignore_classes에서 함수를 제거하고 실행합니다...
official
metal-kernel
pytorch
PyTorch 연산자를 위한 Metal/MPS 커널을 작성합니다. 연산자에 MPS 장치 지원을 추가하거나, Metal 셰이더를 구현하거나, CUDA 커널을 Apple로 포팅할 때 사용하세요.
official
pr-review
pytorch
PyTorch 풀 리퀘스트의 코드 품질, 테스트 커버리지, 보안 및 하위 호환성을 검토합니다. PR을 검토할 때, 코드 변경 사항을 검토하도록 요청받았을 때 사용합니다.
official
pt2-bug-basher
pytorch
PyTorch 2 컴파일러 스택의 Dynamo 그래프 중단, Inductor 코드 생성 오류, AOTAutograd 충돌, 정확도 불일치 등의 디버깅을 수행합니다. 다음 상황에서 사용하세요…
official