skill-progressive-disclosure-design

作者: samber

決定如何將技能內容拆分到 SKILL.md 和參考檔案之間,以提升上下文效率並確保可靠觸發。每當建立新的 Claude 技能、重構現有技能,或 SKILL.md 超過 300-400 行時使用。當使用者提及「漸進式揭露」、「參考檔案」、「拆分技能」、「技能捆綁」、「技能上下文視窗」、「SKILL.md 太長」、「references/ 該放什麼」、「技能結構」,或對任何事項表達不確定時,也觸發此技能。

npx skills add https://github.com/samber/cc-skills --skill skill-progressive-disclosure-design

Skill Progressive Disclosure Design

Each section that recommends a direction includes explicit pros and cons. The decisions in this skill are trade-offs, not rules. The model using this skill should reason from the trade-offs to the user's specific situation rather than apply rules blindly.

Triggering vs. disclosure: separate these first

Two problems get conflated and need separating before any splitting decision.

Triggering is whether Claude invokes the skill at all. Driven entirely by the YAML description. File splitting does not affect triggering. If the question is "my skill doesn't trigger reliably", do not split files, fix the description (use run_loop.py from the skill-creator skill).

Progressive disclosure is what loads after the skill activates. SKILL.md body always loads. references/* only loads when SKILL.md tells the model to read a specific file. scripts/* executes without loading into context at all. This is where context protection happens.

If the user is asking about splitting because of triggering issues, surface the confusion first and redirect.

Default: do not split

A monolithic SKILL.md beats a split one until proven otherwise.

Split only when at least one is true:

  • SKILL.md exceeds ~400 lines and content has natural branches.
  • Empirical evidence (eval transcripts) shows the model wasting context on irrelevant sections.
  • Specific content is large and only needed in narrow conditions.

Pros of staying monolithic:

  • Single context load, no router prose to maintain.
  • No tool-call overhead from reading references.
  • No risk of the model loading the wrong reference or skipping a needed one.
  • Easier to maintain: one file, one source of truth.
  • Better for highly interconnected content where context is global.
  • Easier for human reviewers to read end-to-end.

Cons of staying monolithic:

  • Every invocation pays the full token cost, even when only 10% of the content is relevant.
  • Does not scale past ~500 lines without degrading the model's ability to find what matters.
  • No mechanism to gate rare or niche content.
  • All content must justify its always-loaded status.
  • Maintenance gets harder as the file grows.

Three split axes that work

1. Variant branch

User intent selects exactly one path. SKILL.md holds the decision logic and shared workflow. Each references/<variant>.md holds path-specific detail.

my-skill/
├── SKILL.md               # decision tree + shared steps
└── references/
    ├── variant-a.md
    ├── variant-b.md
    └── variant-c.md

Examples of clean variants: cloud provider, database engine, framework choice, output format, language.

Pros:

  • Each invocation loads only the matching variant; large savings when variants are big.
  • Variants evolve independently, simplifying maintenance.
  • Adding a new variant does not bloat existing content.
  • Mental model is easy: select one path based on input.
  • Maps cleanly to user intent that already mentions the variant.

Cons:

  • Requires routing logic in SKILL.md, eating back some of the line savings.
  • Cross-cutting changes touch every variant file, multiplying effort.
  • Risk of treatments diverging across variants over time.
  • If user intent is ambiguous, the model may load multiple variants and lose the savings.
  • If variants share more than ~60% of their content, the abstraction breaks down.

2. Workflow vs. reference data

SKILL.md holds the procedure (verbs, sequence, decisions). references/ holds lookup material queried by key.

Good reference content: schemas, error code tables, API surface listings, example galleries, configuration option matrices, design tokens.

Pros:

  • Highest leverage of all splits: lookups are narrow, the model reads one entry.
  • Natural conceptual boundary (procedure vs. data).
  • Reference can grow large without affecting per-invocation cost.
  • Adding new reference entries does not touch the workflow.
  • Reference data can often be machine-generated and regenerated.

Cons:

  • The model must know what to look up before reading. Pointer must encode lookup keys explicitly.
  • Fails when the workflow needs to weave reference data inline rather than at discrete points.
  • Splits content that is conceptually unified, harder for human readers.
  • The model may miss broader context that lives only in the reference.
  • Lookup data that is small (under ~50 lines total) is rarely worth splitting.

3. Depth tier (common path vs. edge cases)

SKILL.md covers the 80% case. references/edge-cases.md covers the rest.

The pointer must read like:

If you see X, Y, or Z, stop and read references/edge-cases.md before continuing.

Pros:

  • Common path stays minimal, fast, cheap.
  • Edge cases can be exhaustive without polluting every invocation.
  • Easy to extend edge-case coverage without touching the common path.
  • Mirrors how experts work: defaults first, exceptions on demand.

Cons:

  • The load condition must be sharp and observable from user input. Most edge cases do not satisfy this.
  • Vague conditions cause either always-loading (waste) or never-loading (dead weight).
  • Edge cases get less testing because evals naturally cluster on common queries.
  • The model may follow the common path past a point where it should have escalated.
  • The 80/20 estimate is often wrong; what looked like an edge case turns out to be common.

Splits that do not work

For each anti-pattern, "why it appears attractive" shows what makes designers reach for it; "why it fails" shows what goes wrong in practice.

Topic-based splits where invocations do not cluster by topic

A testing skill split into unit.md, integration.md, mocks.md is a typical example.

Why it appears attractive:

  • Conceptually clean, mirrors how a human would organize documentation.
  • Easy to navigate as a maintainer.
  • Plausibly reduces context per invocation.

Why it fails:

  • Real tasks span 2-3 topics, forcing multiple loads per invocation.
  • Cross-topic concerns get duplicated or fragmented.
  • The savings are theoretical, not empirical.

Splitting to hit a line target without a real branching condition

Why it appears attractive:

  • A heuristic ("keep SKILL.md under 400 lines") feels like a clean rule to satisfy.
  • Splitting feels like progress.

Why it fails:

  • Without a branching condition, references load in parallel or always, providing no savings.
  • Adds router prose to SKILL.md, often making the total content longer.

Rare-but-critical content in references/

Why it appears attractive:

  • The content is large or specialized.
  • Moving it out of SKILL.md feels like good hygiene.

Why it fails:

  • References are optional by design; the model may skip them.
  • If the content is critical, it must be loaded reliably, which means SKILL.md.
  • "Rare" and "critical" together is usually a sign the skill is doing two jobs and should be two skills.

Cosmetic splits (Examples, Notes, Tips files)

Why it appears attractive:

  • Reduces visual clutter in SKILL.md.
  • Feels like good organization.

Why it fails:

  • No load condition: either always loaded (wasted tool call) or never loaded (dead content).
  • Implies an importance hierarchy that does not exist at runtime.
  • Frequently hides content from the model that needs it.

Pointer hygiene

When SKILL.md points at a reference, the pointer is the entire load contract. Rules:

  • Name the user-visible signal that triggers the load. "If the user mentions snapshot tests" not "for testing concerns".
  • One sentence per pointer. Do not summarize the reference content in SKILL.md.
  • Encode the load condition in the filename. go126-simd.md not advanced.md.
  • Top-of-file table of contents for any reference over 300 lines.
  • If two references are co-loaded in most runs, merge them.

Pros of strict pointer hygiene:

  • Wrong-load rate drops sharply.
  • Filename encodes load condition, self-documenting for future maintainers.
  • Forces upfront clarity about when each reference is needed.
  • Makes architecture evals easier to interpret.

Cons of strict pointer hygiene:

  • Some content has no crisp trigger; rules force awkward formulations.
  • Filenames become long and awkward.
  • Requires discipline; easy to drift over time.
  • Can over-constrain useful loads when the trigger condition is genuinely fuzzy.

Use scripts/ before references/

For anything deterministic (formatting, validation, schema generation, file transforms, regex-heavy parsing), a script in scripts/ beats prose in references/.

Pros of scripts over reference prose:

  • Zero context cost for execution.
  • Deterministic, repeatable output.
  • Reusable across invocations without re-reading.
  • Can be unit tested independently.
  • Often faster than prose-driven generation by the model.

Cons of scripts:

  • Requires the runtime to support script execution; not all environments do.
  • Less flexible than letting the model reason over prose.
  • Harder to handle unanticipated edge cases without code changes.
  • Adds a maintenance burden: code in the skill needs to keep working.
  • Users cannot easily customize behavior without editing the script.
  • Failure modes are sharper: script errors stop the workflow.

Decision checklist

Before splitting any content out of SKILL.md, answer:

  1. Does this content have a sharp, observable load condition the model can detect from user input?
  2. Will splitting actually reduce context, accounting for the router prose added to SKILL.md?
  3. Is this reference data (lookup) or procedural (sequence)? Procedural content usually stays.
  4. Could a script handle this deterministically instead?
  5. Across realistic invocations, what fraction of runs would load this file? Below 20%, inline or delete — rarely-loaded references rarely justify the routing overhead. 20–80% is the split sweet spot. Above 80%, promote into SKILL.md — the routing cost exceeds the load savings.

If the answer to question 1 is unclear, do not split.

Evaluating skill architecture

Architecture evaluation is different from output evaluation. Output evals ask "did the skill produce the right thing?". Architecture evals ask "did the skill load the right files for the right reasons, at acceptable cost?". Same harness, different metrics. Run both. Output quality is the floor; architecture is optimization above that floor.

Pros of running architecture evals:

  • Catches dead references, dead SKILL.md sections, and mis-routed content.
  • Quantifies whether a split actually saved tokens or just looked clean.
  • Reveals real load patterns that intuition misses.
  • Forces the eval set to cover all declared paths, surfacing dead paths.
  • Compounds with output evals to catch regressions across both axes.

Cons of running architecture evals:

  • Requires harness setup beyond standard output evals.
  • Eval-set design for path coverage takes work.
  • Metrics need calibration per-skill (thresholds vary with cost profile).
  • Output evals are still required; this adds to total iteration cost.
  • Easy to over-optimize for token cost at the expense of output quality.

Eval set design for architecture

Output evals optimize for output quality across realistic queries. Architecture evals optimize for path coverage. The eval set must exercise every code path the skill claims to have, otherwise the metrics are noise.

Construct, at minimum:

  • One query per declared variant (if the skill uses variant-branch splits).
  • One query per edge-case branch (if depth-tier splits exist).
  • One query per major lookup category (if reference-data splits exist).
  • One query that should hit the common path only and load zero references.
  • 2-3 off-topic queries that should not trigger the skill at all (also tests the description).

If no realistic query triggers a given reference file, that file is dead. Inline it or delete it before running anything.

Instrumentation

Each eval run is executed by a subagent with the skill loaded. Capture per run:

  1. Full transcript including every tool call.
  2. Which references/* files were read (parse view calls on paths inside the skill directory).
  3. Whether scripts/* were invoked.
  4. Total tokens and wall time.
  5. The output (for the parallel output-quality eval).

Persist as transcript.json and loads.json per run, alongside the standard output. The harness from skill-creator already records tokens and time in timing.json; extend its grading step to extract reference loads from transcripts.

Metrics per reference file

Across all eval runs, for each references/*.md:

  • Load rate: fraction of runs that read it.
  • Co-occurrence: for each other reference, fraction of runs that loaded both.
  • Use rate when loaded: of the runs that loaded it, did the content visibly inform the output (cited content, applied procedure, used schema)? Inspect transcripts.
  • Re-read rate: fraction of runs that loaded the same file twice.

Metrics for the skill overall

  • Median and p95 tokens per invocation, with and without references.
  • SKILL.md utilization: read transcripts and identify sections of SKILL.md the model never references in any run. Strong candidates for deletion.
  • Path coverage: did every declared path get hit by at least one query?

Decision rules

ObservationAction
Reference loaded in <20% of runsInline into SKILL.md or delete — routing overhead not justified
Reference loaded in 20–80% of runsLeave split — the sweet spot; routing pays off
Reference loaded in >80% of runsPromote into SKILL.md — always-load cost beats routing cost
Two references co-load in >70% of runsMerge into one file
Reference loaded but not used in outputFix or remove the pointer in SKILL.md
Reference re-read inside the same runSKILL.md routing is unclear; clarify
No query triggers a referenceDelete the reference
SKILL.md section never referenced in any runDelete that section

These thresholds are starting points. Tune them based on the cost profile: small references with cheap loads tolerate lower load rates than large ones.

Comparing two architectures

When choosing between architectures (monolithic vs. split, or split A vs. split B):

  1. Run the identical eval set against both versions.
  2. Run output-quality evals on both. Confirm no regression. If quality drops, the architecture change is a loss regardless of token savings.
  3. Compare median tokens, p95 tokens, and median time per run.
  4. Compare path coverage: does each version reliably reach the same outputs through the expected paths?

A split that saves 15% tokens but adds variance in output quality is worse than the monolith. Reliability beats efficiency.

What the metrics will not tell you

  • Whether the SKILL.md prose is clear. Read transcripts for confused tool calls and dead-end attempts.
  • Whether the description triggers correctly. That is a separate eval (use run_loop.py from the skill-creator skill).
  • Whether content placement matches user mental models. Subjective; review with a human.

The split that looked clean at design time rarely matches real load patterns. Trust the transcripts over your intuitions.

Output when advising

When asked to advise on a specific skill's organization:

  1. Diagnose first. Is this a triggering question or a disclosure question?
  2. Quote relevant content from the existing SKILL.md (or the user's description of it) before recommending.
  3. Propose the minimum viable split. Resist splitting into more files than necessary.
  4. For each proposed reference file, write the exact pointer sentence that would go in SKILL.md.
  5. Surface the trade-offs explicitly. Use the pros/cons in this skill as the model for how to present a recommendation.
  6. If unsure whether a split helps, recommend instrumentation (eval the skill, read transcripts) before committing.

來自 samber 的更多技能

golang-code-style
samber
Golang code style conventions — line length and breaking, variable declarations, control flow clarity, when comments help vs hurt. Use when writing or reviewing Go code, asking about style or clarity, or establishing project coding standards. Not for naming conventions (→ See `samber/cc-skills-golang@golang-naming` skill), linter configuration (→ See `samber/cc-skills-golang@golang-lint` skill), or doc comments (→ See `samber/cc-skills-golang@golang-documentation` skill).
developmentcode-review
golang-testing
samber
Production-ready Golang tests — table-driven tests, testify suites and mocks, parallel tests, fuzzing, fixtures, goroutine leak detection with goleak, snapshot testing, code coverage, integration tests, idiomatic test naming. Use when writing or reviewing Go tests, choosing a testing approach, setting up Go test CI, or debugging flaky/slow tests. For testify-specific APIs see `samber/cc-skills-golang@golang-stretchr-testify`; for measurement methodology see...
developmenttestingcode-review
golang-design-patterns
samber
符合慣例的 Golang 設計模式 — 函數選項、建構子、錯誤流程與串聯、資源管理與生命週期、優雅關閉、韌性、架構、依賴注入、資料處理、串流等。適用於明確選擇架構模式、實作函數選項、設計建構子 API、設定優雅關閉、應用韌性模式,或詢問哪種慣用 Go 模式適合特定問題時。
developmentdesigncode-review
golang-error-handling
samber
Idiomatic Golang error handling — creation, wrapping with %w, errors.Is/As, errors.Join, custom error types, sentinel errors, panic/recover, the single handling rule, structured logging with slog, HTTP request logging middleware, and samber/oops for production errors. Built to make logs usable at scale with log aggregation 3rd-party tools. Apply when creating, wrapping, inspecting, or logging errors in Go code. For samber/oops specifics → See `samber/cc-skills-golang@golang-samber-oops`...
developmentcode-review
golang-performance
samber
Golang 性能優化模式與方法論 - 若遇到 X 瓶頸,則應用 Y。涵蓋減少分配、CPU 效率、記憶體佈局、GC 調校、池化、快取以及熱路徑優化。適用於當性能分析或基準測試已識別出瓶頸,且需要正確的優化模式來解決時。亦適用於進行性能代碼審查時,提出改進建議或可協助快速識別性能增益的基準測試。不適用於測量方法論(→...
developmentcode-review
golang-security
samber
Golang的安全最佳實踐與漏洞防範。涵蓋注入攻擊(SQL、命令、XSS)、密碼學、檔案系統安全、網路安全、Cookie、機密管理、記憶體安全及日誌記錄。適用於撰寫、審查或稽核Go程式碼的安全性,或處理涉及加密、I/O、機密管理、使用者輸入處理或身分驗證的高風險程式碼。包含安全工具的配置。
securitycode-reviewdevelopment
golang-database
samber
Go 資料庫存取的全面指南 — 參數化查詢、結構掃描、可空欄位、交易、隔離層級、SELECT FOR UPDATE、連線池、批次處理、上下文傳遞與遷移工具。適用於撰寫、審查或除錯與 PostgreSQL、MariaDB、MySQL 或 SQLite 互動的 Golang 程式碼;資料庫測試;或關於 database/sql、sqlx 或 pgx 的問題。不產生資料庫結構或遷移 SQL。
developmentdatabase
golang-lint
samber
針對 Golang 專案的 lint 最佳實務與 golangci-lint 配置 — 執行 linter、設定 .golangci.yml、使用 nolint 指令抑制警告、解讀 lint 輸出,以及選擇 linter。適用於配置 golangci-lint、詢問 lint 警告或 nolint 抑制方式、設定程式碼品質工具,或挑選 linter 時。亦適用於使用者提及 golangci-lint、go vet、staticcheck 或 revive 時。
developmentcode-reviewtesting