doctree-mcp

BM25 search + tree navigation over markdown docs for AI agents. No embeddings, no LLM calls at index time.

doctree-mcp

Agentic document retrieval over markdown, CSV, and JSONL. BM25 + tree navigation via MCP — no vector DB, no embeddings, no LLM calls at index time.

The pitch: MCP provides the structural primitives (a navigable tree, BM25, glossary, row lookup). The bundled skills provide the procedural knowledge (how to walk that tree). Together the agent behaves like a trained research librarian — not a one-shot searcher. See The Skill + MCP Pattern.


Quick Start

Have docs already? Point a client at them:

# In your AI tool's MCP config — see docs/CLIENTS.md for per-tool snippets
{ "mcpServers": { "doctree": {
    "command": "bunx", "args": ["doctree-mcp"],
    "env": { "DOCS_ROOT": "./docs", "WIKI_WRITE": "1" }
} } }

Restart the tool → ask "search the docs for X" or invoke the doc-read prompt.

Starting fresh? Scaffold a Karpathy-style LLM wiki:

bunx doctree-mcp init          # configure current tool
bunx doctree-mcp init --all    # configure every supported client
bunx doctree-mcp init --dry-run

Creates docs/wiki/ (LLM-maintained) + docs/raw-sources/ (your inputs), writes the MCP config, installs a post-write lint hook, appends wiki conventions to CLAUDE.md / AGENTS.md / .cursor/rules/.


Operation Modes

ModeUse whenGuide
stdio (default)Local dev, agent on your machineClient setup
HTTP (Streamable HTTP)Teams, CI, hosted agentsDeployment — Railway · Fly · Render · Cloudflare Containers · Docker
CLIinit, lint, debug-indexOperation modes

Full decision tree: Operation Modes.


How It Works — Retrieve · Curate · Add

Agent: "How does token refresh work?"

→ search_documents("token refresh")
  #1  auth/middleware.md § Token Refresh Flow       score: 12.4
  #2  auth/oauth.md       § Refresh Token Lifecycle  score: 8.7

→ get_tree("docs:auth:middleware")
  [n1] # Auth Middleware
    [n4] ## Token Refresh Flow
      [n5] ### Automatic Refresh

→ navigate_tree("docs:auth:middleware", "n4")   ← n4 + descendants

Core read tools (always on):

ToolPurpose
search_documentsBM25 keyword search + facet filters + glossary expansion (markdown · CSV · JSONL)
get_treeTable of contents — headings, word counts, summaries
get_node_contentFull text of a specific section by node ID
navigate_treeA section plus all descendants in one call
lookup_rowO(1) exact-key lookup for structured data rows (e.g. PROJ-44)

Wiki write tools (opt-in with WIKI_WRITE=1):

ToolPurpose
find_similarDuplicate detection with overlap ratios
draft_wiki_entryScaffold: suggested path, inferred frontmatter, glossary hits
write_wiki_entryValidated write: path containment, schema, duplicate guards, dry-run

Safety: path containment · frontmatter validation · duplicate detection · dry-run · overwrite protection.

Deprecated aliases (list_documents, find_files, find_symbol) are superseded by search_documents — still functional, no longer recommended.


The Skill + MCP Pattern

Most retrieval tools hand the agent a search box and hope for the best. doctree-mcp hands it a tree, and the bundled skills teach it how to walk one.

  • MCP = structural primitives. search_documents, get_tree, navigate_tree, get_node_content, lookup_row return tree positions the agent reasons over — not finished answers.
  • Skills = procedural knowledge. /doc-read, /doc-write, /doc-lint encode breadcrumb drill-down: search → outline → navigate → retrieve. The agent learns the policy, not just the API.

That pairing doesn't exist cleanly elsewhere:

ApproachPrimitiveSkill teachesGap
Managed hybrid RAG (Cloudflare AI Search, Nia)Flat chunks + similarityBlack-box score, no audit trail
Tool-returns-answer (Context7)2 tools returning answersQuery shapeAgent can't reason about skipped content
Skill-over-CLI (QMD)CLI over flat searchQuery expansionNo tree to navigate
doctree-mcp + /doc-readNavigable treeBreadcrumbs, multi-instance routing, wiki compilation

Why iterative retrieval wins:

  • Context rot. Stuffing a 1M-token window with chunks degrades output. Breadcrumb navigation keeps working memory small.
  • Auditability. search_documents → get_tree → navigate_tree → get_node_content is a replayable trail. A cosine score is not. Regulated domains can ship the former.
  • Progressive disclosure. Fewer navigable primitives beat tool sprawl (cf. Cloudflare Code Mode).

Multi-instance = client-side federation. Register several doctree servers under different names; the /doc-read skill encodes the routing policy. Add or remove instances without touching the skill. See Client setup → Multi-instance routing.


The LLM Wiki Pattern

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│  Raw Sources    │     │  The Wiki        │     │  The Schema     │
│  (immutable)    │ ──→ │  (LLM-maintained)│ ←── │  (you define)   │
│  notes · logs   │     │  runbooks · refs │     │  CLAUDE.md rules │
└─────────────────┘     └─────────────────┘     └─────────────────┘

Inspired by Karpathy's LLM Wiki. Full walkthrough: docs/LLM-WIKI-GUIDE.md.


Configuration (summary)

---
title: "Descriptive Title"
description: "One-line summary — boosts ranking"
tags: [relevant, terms]
type: runbook          # runbook | guide | reference | tutorial | architecture | adr
category: auth
---

All non-reserved frontmatter fields become filter facets:

search_documents("auth", filters: { type: "runbook", tags: ["production"] })

Common env vars:

VariableDefaultDescription
DOCS_ROOT./docsDocs folder
DOCS_GLOB**/*.mdComma-separated globs (**/*.md,**/*.csv,**/*.jsonl)
DOCS_ROOTSWeighted multi-collection (./wiki:1.0,./rfcs:0.5)
PORT3100HTTP mode port
WIKI_WRITE(unset)1 enables write tools
GLOSSARY_PATH$DOCS_ROOT/glossary.jsonQuery-expansion glossary

Full reference: docs/CONFIGURATION.md.

Glossary — place glossary.json in docs root for bidirectional query expansion:

{ "CLI": ["command line interface"], "K8s": ["kubernetes"] }

Acronym definitions like "TLS (Transport Layer Security)" are also auto-extracted.

Structured data — CSV/JSONL files become documents where each row is a tree node. Column roles (id, title, description, facets, URL) are auto-detected from headers. See docs/STRUCTURED-DATA.md.


Running from Source

git clone https://github.com/joesaby/doctree-mcp.git
cd doctree-mcp && bun install

DOCS_ROOT=./docs bun run serve          # stdio
DOCS_ROOT=./docs bun run serve:http     # HTTP (port 3100)
DOCS_ROOT=./docs bun run index          # CLI: inspect indexed output
bun test

Performance

OperationTimeToken cost
Full index (900 docs)2–5s0
Incremental re-index~50ms0
Search5–30ms~300–1K tokens
Tree outline<1ms~200–800 tokens

Docs

Setup & operation

  • Operation Modes — stdio · HTTP · CLI
  • Client Setup — Claude Code · Cursor · Windsurf · Codex · OpenCode · Claude Desktop
  • Deployment — Railway · Fly.io · Render · Cloudflare Containers · Docker
  • Configuration — env vars, frontmatter, ranking tuning

Patterns & concepts

Source


Standing on Shoulders

License

MIT

相关服务器

NotebookLM 网页导入器

一键将网页和 YouTube 视频导入 NotebookLM。超过 200,000 用户信赖。

安装 Chrome 扩展