doctree-mcp
BM25 search + tree navigation over markdown docs for AI agents. No embeddings, no LLM calls at index time.
doctree-mcp
Agentic document retrieval over markdown, CSV, and JSONL. BM25 + tree navigation via MCP — no vector DB, no embeddings, no LLM calls at index time.
The pitch: MCP provides the structural primitives (a navigable tree, BM25, glossary, row lookup). The bundled skills provide the procedural knowledge (how to walk that tree). Together the agent behaves like a trained research librarian — not a one-shot searcher. See The Skill + MCP Pattern.
Quick Start
Have docs already? Point a client at them:
# In your AI tool's MCP config — see docs/CLIENTS.md for per-tool snippets
{ "mcpServers": { "doctree": {
"command": "bunx", "args": ["doctree-mcp"],
"env": { "DOCS_ROOT": "./docs", "WIKI_WRITE": "1" }
} } }
Restart the tool → ask "search the docs for X" or invoke the doc-read prompt.
Starting fresh? Scaffold a Karpathy-style LLM wiki:
bunx doctree-mcp init # configure current tool
bunx doctree-mcp init --all # configure every supported client
bunx doctree-mcp init --dry-run
Creates docs/wiki/ (LLM-maintained) + docs/raw-sources/ (your inputs), writes the MCP config, installs a post-write lint hook, appends wiki conventions to CLAUDE.md / AGENTS.md / .cursor/rules/.
Operation Modes
| Mode | Use when | Guide |
|---|---|---|
| stdio (default) | Local dev, agent on your machine | Client setup |
| HTTP (Streamable HTTP) | Teams, CI, hosted agents | Deployment — Railway · Fly · Render · Cloudflare Containers · Docker |
| CLI | init, lint, debug-index | Operation modes |
Full decision tree: Operation Modes.
How It Works — Retrieve · Curate · Add
Agent: "How does token refresh work?"
→ search_documents("token refresh")
#1 auth/middleware.md § Token Refresh Flow score: 12.4
#2 auth/oauth.md § Refresh Token Lifecycle score: 8.7
→ get_tree("docs:auth:middleware")
[n1] # Auth Middleware
[n4] ## Token Refresh Flow
[n5] ### Automatic Refresh
→ navigate_tree("docs:auth:middleware", "n4") ← n4 + descendants
Core read tools (always on):
| Tool | Purpose |
|---|---|
search_documents | BM25 keyword search + facet filters + glossary expansion (markdown · CSV · JSONL) |
get_tree | Table of contents — headings, word counts, summaries |
get_node_content | Full text of a specific section by node ID |
navigate_tree | A section plus all descendants in one call |
lookup_row | O(1) exact-key lookup for structured data rows (e.g. PROJ-44) |
Wiki write tools (opt-in with WIKI_WRITE=1):
| Tool | Purpose |
|---|---|
find_similar | Duplicate detection with overlap ratios |
draft_wiki_entry | Scaffold: suggested path, inferred frontmatter, glossary hits |
write_wiki_entry | Validated write: path containment, schema, duplicate guards, dry-run |
Safety: path containment · frontmatter validation · duplicate detection · dry-run · overwrite protection.
Deprecated aliases (list_documents, find_files, find_symbol) are superseded by search_documents — still functional, no longer recommended.
The Skill + MCP Pattern
Most retrieval tools hand the agent a search box and hope for the best. doctree-mcp hands it a tree, and the bundled skills teach it how to walk one.
- MCP = structural primitives.
search_documents,get_tree,navigate_tree,get_node_content,lookup_rowreturn tree positions the agent reasons over — not finished answers. - Skills = procedural knowledge.
/doc-read,/doc-write,/doc-lintencode breadcrumb drill-down: search → outline → navigate → retrieve. The agent learns the policy, not just the API.
That pairing doesn't exist cleanly elsewhere:
| Approach | Primitive | Skill teaches | Gap |
|---|---|---|---|
| Managed hybrid RAG (Cloudflare AI Search, Nia) | Flat chunks + similarity | — | Black-box score, no audit trail |
| Tool-returns-answer (Context7) | 2 tools returning answers | Query shape | Agent can't reason about skipped content |
| Skill-over-CLI (QMD) | CLI over flat search | Query expansion | No tree to navigate |
doctree-mcp + /doc-read | Navigable tree | Breadcrumbs, multi-instance routing, wiki compilation | — |
Why iterative retrieval wins:
- Context rot. Stuffing a 1M-token window with chunks degrades output. Breadcrumb navigation keeps working memory small.
- Auditability.
search_documents → get_tree → navigate_tree → get_node_contentis a replayable trail. A cosine score is not. Regulated domains can ship the former. - Progressive disclosure. Fewer navigable primitives beat tool sprawl (cf. Cloudflare Code Mode).
Multi-instance = client-side federation. Register several doctree servers under different names; the /doc-read skill encodes the routing policy. Add or remove instances without touching the skill. See Client setup → Multi-instance routing.
The LLM Wiki Pattern
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Raw Sources │ │ The Wiki │ │ The Schema │
│ (immutable) │ ──→ │ (LLM-maintained)│ ←── │ (you define) │
│ notes · logs │ │ runbooks · refs │ │ CLAUDE.md rules │
└─────────────────┘ └─────────────────┘ └─────────────────┘
Inspired by Karpathy's LLM Wiki. Full walkthrough: docs/LLM-WIKI-GUIDE.md.
Configuration (summary)
---
title: "Descriptive Title"
description: "One-line summary — boosts ranking"
tags: [relevant, terms]
type: runbook # runbook | guide | reference | tutorial | architecture | adr
category: auth
---
All non-reserved frontmatter fields become filter facets:
search_documents("auth", filters: { type: "runbook", tags: ["production"] })
Common env vars:
| Variable | Default | Description |
|---|---|---|
DOCS_ROOT | ./docs | Docs folder |
DOCS_GLOB | **/*.md | Comma-separated globs (**/*.md,**/*.csv,**/*.jsonl) |
DOCS_ROOTS | — | Weighted multi-collection (./wiki:1.0,./rfcs:0.5) |
PORT | 3100 | HTTP mode port |
WIKI_WRITE | (unset) | 1 enables write tools |
GLOSSARY_PATH | $DOCS_ROOT/glossary.json | Query-expansion glossary |
Full reference: docs/CONFIGURATION.md.
Glossary — place glossary.json in docs root for bidirectional query expansion:
{ "CLI": ["command line interface"], "K8s": ["kubernetes"] }
Acronym definitions like "TLS (Transport Layer Security)" are also auto-extracted.
Structured data — CSV/JSONL files become documents where each row is a tree node. Column roles (id, title, description, facets, URL) are auto-detected from headers. See docs/STRUCTURED-DATA.md.
Running from Source
git clone https://github.com/joesaby/doctree-mcp.git
cd doctree-mcp && bun install
DOCS_ROOT=./docs bun run serve # stdio
DOCS_ROOT=./docs bun run serve:http # HTTP (port 3100)
DOCS_ROOT=./docs bun run index # CLI: inspect indexed output
bun test
Performance
| Operation | Time | Token cost |
|---|---|---|
| Full index (900 docs) | 2–5s | 0 |
| Incremental re-index | ~50ms | 0 |
| Search | 5–30ms | ~300–1K tokens |
| Tree outline | <1ms | ~200–800 tokens |
Docs
Setup & operation
- Operation Modes — stdio · HTTP · CLI
- Client Setup — Claude Code · Cursor · Windsurf · Codex · OpenCode · Claude Desktop
- Deployment — Railway · Fly.io · Render · Cloudflare Containers · Docker
- Configuration — env vars, frontmatter, ranking tuning
Patterns & concepts
- LLM Wiki Guide — agent-maintained knowledge base walkthrough
- Structured Data — CSV / JSONL indexing
- Architecture & Design — BM25 internals, tree navigation
- Competitive Analysis — PageIndex, QMD, GitMCP, Context7, managed RAG
Source
- Prompts — MCP prompt templates
- Skills:
/doc-read·/doc-write·/doc-lint
Standing on Shoulders
- PageIndex — hierarchical tree navigation
- Pagefind by CloudCannon — BM25 scoring, positional index, facets
- Bun.markdown by Oven — native CommonMark parser
- Karpathy's LLM Wiki — the LLM-maintained wiki pattern
License
MIT
Похожие серверы
SearXNG MCP Server
A web search server powered by the SearXNG API.
Pokemon TCG Card Search MCP
Search and display Pokemon Trading Card Game cards using the Pokemon TCG API.
SourceHarbor MCP
Source-first MCP knowledge control tower for watchlists, briefings, grounded search, and evidence-backed Ask.
Dartpoint
Access public disclosure information for Korean companies (DART) using the dartpoint.ai API.
vinted-mcp-server
First MCP server for Vinted marketplace. Search millions of second-hand items, compare prices across 19 countries (FR, DE, ES, IT, NL, BE, UK...), analyze seller profiles with ratings and verification status, detect cross-country arbitrage opportunities, and track trending products. 5 tools: search_items, get_item, get_seller, compare_prices, get_trending. 2 resources: countries list, categories. Works with Claude Desktop, Cursor, Windsurf, and any MCP-compatible client. Zero config — just npx.
Japan postalcode MCP
An MCP server for searching Japanese postal codes.
BytesAgain
Search AI agent skills and MCP servers via MCP or REST API. Free, no auth required. Supports 7 languages.
SerpApi MCP
SerpApi MCP Server for Google and other search engine results
Hermes Search
Provides full-text and semantic search over structured and unstructured data using Azure Cognitive Search.
Google Search Console
An MCP server for accessing Google Search Console data, including site performance and indexing status.