doctree-mcp
BM25 search + tree navigation over markdown docs for AI agents. No embeddings, no LLM calls at index time.
doctree-mcp
Agentic document retrieval over markdown, CSV, and JSONL. BM25 + tree navigation via MCP — no vector DB, no embeddings, no LLM calls at index time.
The pitch: MCP provides the structural primitives (a navigable tree, BM25, glossary, row lookup). The bundled skills provide the procedural knowledge (how to walk that tree). Together the agent behaves like a trained research librarian — not a one-shot searcher. See The Skill + MCP Pattern.
Quick Start
Have docs already? Point a client at them:
# In your AI tool's MCP config — see docs/CLIENTS.md for per-tool snippets
{ "mcpServers": { "doctree": {
"command": "bunx", "args": ["doctree-mcp"],
"env": { "DOCS_ROOT": "./docs", "WIKI_WRITE": "1" }
} } }
Restart the tool → ask "search the docs for X" or invoke the doc-read prompt.
Starting fresh? Scaffold a Karpathy-style LLM wiki:
bunx doctree-mcp init # configure current tool
bunx doctree-mcp init --all # configure every supported client
bunx doctree-mcp init --dry-run
Creates docs/wiki/ (LLM-maintained) + docs/raw-sources/ (your inputs), writes the MCP config, installs a post-write lint hook, appends wiki conventions to CLAUDE.md / AGENTS.md / .cursor/rules/.
Operation Modes
| Mode | Use when | Guide |
|---|---|---|
| stdio (default) | Local dev, agent on your machine | Client setup |
| HTTP (Streamable HTTP) | Teams, CI, hosted agents | Deployment — Railway · Fly · Render · Cloudflare Containers · Docker |
| CLI | init, lint, debug-index | Operation modes |
Full decision tree: Operation Modes.
How It Works — Retrieve · Curate · Add
Agent: "How does token refresh work?"
→ search_documents("token refresh")
#1 auth/middleware.md § Token Refresh Flow score: 12.4
#2 auth/oauth.md § Refresh Token Lifecycle score: 8.7
→ get_tree("docs:auth:middleware")
[n1] # Auth Middleware
[n4] ## Token Refresh Flow
[n5] ### Automatic Refresh
→ navigate_tree("docs:auth:middleware", "n4") ← n4 + descendants
Core read tools (always on):
| Tool | Purpose |
|---|---|
search_documents | BM25 keyword search + facet filters + glossary expansion (markdown · CSV · JSONL) |
get_tree | Table of contents — headings, word counts, summaries |
get_node_content | Full text of a specific section by node ID |
navigate_tree | A section plus all descendants in one call |
lookup_row | O(1) exact-key lookup for structured data rows (e.g. PROJ-44) |
Wiki write tools (opt-in with WIKI_WRITE=1):
| Tool | Purpose |
|---|---|
find_similar | Duplicate detection with overlap ratios |
draft_wiki_entry | Scaffold: suggested path, inferred frontmatter, glossary hits |
write_wiki_entry | Validated write: path containment, schema, duplicate guards, dry-run |
Safety: path containment · frontmatter validation · duplicate detection · dry-run · overwrite protection.
Deprecated aliases (list_documents, find_files, find_symbol) are superseded by search_documents — still functional, no longer recommended.
The Skill + MCP Pattern
Most retrieval tools hand the agent a search box and hope for the best. doctree-mcp hands it a tree, and the bundled skills teach it how to walk one.
- MCP = structural primitives.
search_documents,get_tree,navigate_tree,get_node_content,lookup_rowreturn tree positions the agent reasons over — not finished answers. - Skills = procedural knowledge.
/doc-read,/doc-write,/doc-lintencode breadcrumb drill-down: search → outline → navigate → retrieve. The agent learns the policy, not just the API.
That pairing doesn't exist cleanly elsewhere:
| Approach | Primitive | Skill teaches | Gap |
|---|---|---|---|
| Managed hybrid RAG (Cloudflare AI Search, Nia) | Flat chunks + similarity | — | Black-box score, no audit trail |
| Tool-returns-answer (Context7) | 2 tools returning answers | Query shape | Agent can't reason about skipped content |
| Skill-over-CLI (QMD) | CLI over flat search | Query expansion | No tree to navigate |
doctree-mcp + /doc-read | Navigable tree | Breadcrumbs, multi-instance routing, wiki compilation | — |
Why iterative retrieval wins:
- Context rot. Stuffing a 1M-token window with chunks degrades output. Breadcrumb navigation keeps working memory small.
- Auditability.
search_documents → get_tree → navigate_tree → get_node_contentis a replayable trail. A cosine score is not. Regulated domains can ship the former. - Progressive disclosure. Fewer navigable primitives beat tool sprawl (cf. Cloudflare Code Mode).
Multi-instance = client-side federation. Register several doctree servers under different names; the /doc-read skill encodes the routing policy. Add or remove instances without touching the skill. See Client setup → Multi-instance routing.
The LLM Wiki Pattern
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Raw Sources │ │ The Wiki │ │ The Schema │
│ (immutable) │ ──→ │ (LLM-maintained)│ ←── │ (you define) │
│ notes · logs │ │ runbooks · refs │ │ CLAUDE.md rules │
└─────────────────┘ └─────────────────┘ └─────────────────┘
Inspired by Karpathy's LLM Wiki. Full walkthrough: docs/LLM-WIKI-GUIDE.md.
Configuration (summary)
---
title: "Descriptive Title"
description: "One-line summary — boosts ranking"
tags: [relevant, terms]
type: runbook # runbook | guide | reference | tutorial | architecture | adr
category: auth
---
All non-reserved frontmatter fields become filter facets:
search_documents("auth", filters: { type: "runbook", tags: ["production"] })
Common env vars:
| Variable | Default | Description |
|---|---|---|
DOCS_ROOT | ./docs | Docs folder |
DOCS_GLOB | **/*.md | Comma-separated globs (**/*.md,**/*.csv,**/*.jsonl) |
DOCS_ROOTS | — | Weighted multi-collection (./wiki:1.0,./rfcs:0.5) |
PORT | 3100 | HTTP mode port |
WIKI_WRITE | (unset) | 1 enables write tools |
GLOSSARY_PATH | $DOCS_ROOT/glossary.json | Query-expansion glossary |
Full reference: docs/CONFIGURATION.md.
Glossary — place glossary.json in docs root for bidirectional query expansion:
{ "CLI": ["command line interface"], "K8s": ["kubernetes"] }
Acronym definitions like "TLS (Transport Layer Security)" are also auto-extracted.
Structured data — CSV/JSONL files become documents where each row is a tree node. Column roles (id, title, description, facets, URL) are auto-detected from headers. See docs/STRUCTURED-DATA.md.
Running from Source
git clone https://github.com/joesaby/doctree-mcp.git
cd doctree-mcp && bun install
DOCS_ROOT=./docs bun run serve # stdio
DOCS_ROOT=./docs bun run serve:http # HTTP (port 3100)
DOCS_ROOT=./docs bun run index # CLI: inspect indexed output
bun test
Performance
| Operation | Time | Token cost |
|---|---|---|
| Full index (900 docs) | 2–5s | 0 |
| Incremental re-index | ~50ms | 0 |
| Search | 5–30ms | ~300–1K tokens |
| Tree outline | <1ms | ~200–800 tokens |
Docs
Setup & operation
- Operation Modes — stdio · HTTP · CLI
- Client Setup — Claude Code · Cursor · Windsurf · Codex · OpenCode · Claude Desktop
- Deployment — Railway · Fly.io · Render · Cloudflare Containers · Docker
- Configuration — env vars, frontmatter, ranking tuning
Patterns & concepts
- LLM Wiki Guide — agent-maintained knowledge base walkthrough
- Structured Data — CSV / JSONL indexing
- Architecture & Design — BM25 internals, tree navigation
- Competitive Analysis — PageIndex, QMD, GitMCP, Context7, managed RAG
Source
- Prompts — MCP prompt templates
- Skills:
/doc-read·/doc-write·/doc-lint
Standing on Shoulders
- PageIndex — hierarchical tree navigation
- Pagefind by CloudCannon — BM25 scoring, positional index, facets
- Bun.markdown by Oven — native CommonMark parser
- Karpathy's LLM Wiki — the LLM-maintained wiki pattern
License
MIT
İlgili Sunucular
Tavily
Search engine for AI agents (search + extract) powered by Tavily
招投标大数据服务
Query comprehensive enterprise information from e-commerce platforms, including store details, sales data, and product statistics.
RAG Documentation
Retrieve and process documentation using vector search to provide context for AI assistants.
AgentBridge
A specialized gateway for AI agents to fetch and parse Chinese web content into clean Markdown. Pay-per-fetch via x402.
CarsXE
A powerful, easy-to-use API for vehicle data including specs, market value, license plate decoding, and more.
Obsidian Omnisearch
Search your Obsidian vault using the Omnisearch plugin via a REST API.
AWS Documentation
Fetch, convert, and search AWS documentation pages, with recommendations for related content.
EntRoute MCP
MCP Server for AI agents to discover and call pay-per-request APIs via EntRoute. Give Claude, Cursor, Windsurf, or any MCP-compatible agent access to 350+ verified x402 API endpoints across 110+ capabilities — DeFi prices, web search, prediction markets, news, and more.
Sci-Hub MCP Server
Search and access academic papers from Sci-Hub by DOI, title, or keyword.
Perplexity
Interacting with Perplexity