mcp-codebase-index
Structural codebase indexer with 17 query tools. 87% token reduction. Zero dependencies.
mcp-codebase-index
A structural codebase indexer with an MCP server for AI-assisted development. Zero runtime dependencies — uses Python's ast module for Python analysis and regex-based parsing for TypeScript/JS, Go, Rust, and C#. Requires Python 3.11+.
What It Does
Indexes codebases by parsing source files into structural metadata -- functions, classes, imports, dependency graphs, and cross-file call chains -- then exposes 18 query tools via the Model Context Protocol, enabling Claude Code and other MCP clients to navigate codebases efficiently without reading entire files.
Automatic incremental re-indexing: In git repositories, the index stays up to date automatically. Before every query, the server checks git diff and git status (~1-2ms). If files changed, only those files are re-parsed and the dependency graph is rebuilt. No need to manually call reindex after edits, branch switches, or pulls.
Persistent disk cache: The index is saved to a pickle cache file (.codebase-index-cache.pkl) after every build. On subsequent server starts, the cache is loaded and validated against the current git HEAD — if the ref matches, startup is instant. If a small number of files changed (≤20), the cached index is loaded and incrementally updated instead of rebuilt from scratch. This eliminates the cold-start penalty when restarting Claude Code sessions, restarting the MCP server, or resuming work after context compaction.
Language Support
| Language | Method | Extracts |
|---|---|---|
Python (.py) | AST parsing | Functions, classes, methods, imports, dependency graph |
TypeScript/JS (.ts, .tsx, .js, .jsx) | Regex-based | Functions, arrow functions, classes, interfaces, type aliases, imports |
Go (.go) | Regex-based | Functions, methods (receiver-based), structs, interfaces, type aliases, imports, doc comments |
Rust (.rs) | Regex-based | Functions (pub/async/const/unsafe), structs, enums, traits, impl blocks, use statements, attributes, doc comments, macro_rules |
C# (.cs) | Regex-based | Classes, interfaces, structs, enums, records, methods, constructors, using directives, [Attributes], /// XML doc comments |
Markdown/Text (.md, .txt, .rst) | Heading detection | Sections (# headings, underlines, numbered, ALL-CAPS) |
| Other | Generic | Line counts only |
Installation
pip install "mcp-codebase-index[mcp]"
The [mcp] extra includes the MCP server dependency. Omit it if you only need the programmatic API.
For development (from a local clone):
pip install -e ".[dev,mcp]"
MCP Server
Running
# As a console script
PROJECT_ROOT=/path/to/project mcp-codebase-index
# As a Python module
PROJECT_ROOT=/path/to/project python -m mcp_codebase_index.server
PROJECT_ROOT specifies which directory to index. Defaults to the current working directory.
Persistent Cache
In git repositories, the server automatically caches the index to .codebase-index-cache.pkl in the project root. On startup:
- Cache hit (exact match): If the cached git ref matches the current HEAD, the index loads instantly from disk — no parsing, no file walking.
- Cache hit (small changeset): If ≤20 files changed since the cached ref, the cached index is loaded and incrementally updated on the first query.
- Cache miss: If the changeset is large or no cache exists, a full rebuild runs and saves a new cache.
Add .codebase-index-cache.pkl to your .gitignore — it's a local-only build artifact.
Configuring with OpenClaw
Install the package on the machine where OpenClaw is running:
# Local install
pip install "mcp-codebase-index[mcp]"
# Or inside a Docker container / remote VPS
docker exec -it openclaw bash
pip install "mcp-codebase-index[mcp]"
Add the MCP server to your OpenClaw agent config (openclaw.json):
{
"agents": {
"list": [{
"id": "main",
"mcp": {
"servers": [
{
"name": "codebase-index",
"command": "mcp-codebase-index",
"env": {
"PROJECT_ROOT": "/path/to/project"
}
}
]
}
}]
}
}
Restart OpenClaw and verify the connection:
openclaw mcp list
All 18 tools will be available to your agent.
Performance note: The server automatically detects file changes via git diff before every query (~1-2ms) and incrementally re-indexes only what changed. However, OpenClaw's default MCP integration via mcporter spawns a fresh server process per tool call, which discards the in-memory index and forces a full rebuild each time (~1-2s for small projects, longer for large ones). With persistent caching, these cold starts are now significantly faster — the server loads from the disk cache instead of re-parsing the entire codebase. For persistent connections (avoiding even the cache load overhead), use the openclaw-mcp-adapter plugin, which connects once at startup and keeps the server running:
pip install openclaw-mcp-adapter
Configuring with Claude Code
Add to your project's .mcp.json:
{
"mcpServers": {
"codebase-index": {
"command": "mcp-codebase-index",
"env": {
"PROJECT_ROOT": "/path/to/project"
}
}
}
}
Or using the Python module directly (useful if installed in a virtualenv):
{
"mcpServers": {
"codebase-index": {
"command": "/path/to/.venv/bin/python3",
"args": ["-m", "mcp_codebase_index.server"],
"env": {
"PROJECT_ROOT": "/path/to/project"
}
}
}
}
Reinforcing Tool Usage with Hooks
Claude Code tends to default to built-in Glob/Grep/Read tools even when codebase-index is available. In addition to CLAUDE.md instructions (see below), you can add hooks that fire on every prompt to reinforce the behavior. Add this to .claude/settings.local.json:
{
"hooks": {
"SessionStart": [
{
"hooks": [
{
"type": "command",
"command": "echo 'CRITICAL REMINDER: Use codebase-index MCP tools FIRST for ALL code navigation (find_symbol, get_function_source, search_codebase, get_dependencies, etc). Only fall back to Glob/Grep/Read for non-code files.'"
}
]
}
],
"UserPromptSubmit": [
{
"hooks": [
{
"type": "command",
"command": "echo 'Use codebase-index MCP tools first for code navigation.'"
}
]
}
]
}
}
Hook stdout is injected as context Claude sees before responding. SessionStart fires on startup, resume, and context compaction. UserPromptSubmit fires on every turn.
Important: Make the AI Actually Use Indexed Tools
By default, AI assistants will ignore the indexed tools and fall back to reading entire files with Glob/Grep/Read. Soft language like "prefer" gets rationalized away. Add this to your project's CLAUDE.md (or equivalent instructions file) with mandatory language:
## Codebase Navigation — MANDATORY
You MUST use codebase-index MCP tools FIRST when exploring or navigating the codebase. This is not optional.
- ALWAYS start with: get_project_summary, find_symbol, get_function_source, get_class_source,
get_structure_summary, get_dependencies, get_dependents, get_change_impact, get_call_chain, search_codebase
- Only fall back to Read/Glob/Grep when codebase-index tools genuinely don't have what you need
(e.g. reading non-code files, config, frontmatter)
- If you catch yourself reaching for Glob/Grep/Read to find or understand code, STOP and use
codebase-index instead
The word "prefer" is too weak — models treat it as a suggestion and default to familiar tools. Mandatory language with explicit fallback criteria is what actually changes behavior.
Available Tools (18)
| Tool | Description |
|---|---|
get_project_summary | File count, packages, top classes/functions |
list_files | List indexed files with optional glob filter |
get_structure_summary | Structure of a file or the whole project |
get_functions | List functions with name, lines, params |
get_classes | List classes with name, lines, methods, bases |
get_imports | List imports with module, names, line |
get_function_source | Full source of a function/method |
get_class_source | Full source of a class |
find_symbol | Find where a symbol is defined (file, line, type) |
get_dependencies | What a symbol calls/uses |
get_dependents | What calls/uses a symbol |
get_change_impact | Direct + transitive dependents |
get_call_chain | Shortest dependency path (BFS) |
get_file_dependencies | Files imported by a given file |
get_file_dependents | Files that import from a given file |
search_codebase | Regex search across all files (max 100 results) |
reindex | Force full re-index (rarely needed — incremental updates happen automatically in git repos) |
get_usage_stats | Session efficiency stats: tool calls, characters returned vs total source, estimated token savings |
Benchmarks
Tested across four real-world projects on an M-series MacBook Pro, from a small project to CPython itself (1.1 million lines):
Index Build Performance
| Project | Files | Lines | Functions | Classes | Index Time | Peak Memory |
|---|---|---|---|---|---|---|
| RMLPlus | 36 | 7,762 | 237 | 55 | 0.9s | 2.4 MB |
| FastAPI | 2,556 | 332,160 | 4,139 | 617 | 5.7s | 55 MB |
| Django | 3,714 | 707,493 | 29,995 | 7,371 | 36.2s | 126 MB |
| CPython | 2,464 | 1,115,334 | 59,620 | 9,037 | 55.9s | 197 MB |
With persistent caching, subsequent startups bypass the full build entirely. Cache load time is negligible compared to parsing — a cache hit on CPython restores the full index in under a second instead of 56s.
Query Response Size vs Total Source
Querying CPython — 41 million characters of source code:
| Query | Response | Total Source | Reduction |
|---|---|---|---|
find_symbol("TestCase") | 67 chars | 41,077,561 chars | 99.9998% |
get_dependencies("compile") | 115 chars | 41,077,561 chars | 99.9997% |
get_change_impact("TestCase") | 16,812 chars | 41,077,561 chars | 99.96% |
get_function_source("compile") | 4,531 chars | 41,077,561 chars | 99.99% |
get_function_source("run_unittest") | 439 chars | 41,077,561 chars | 99.999% |
find_symbol returns 54-67 characters regardless of whether the project is 7K lines or 1.1M lines. Response size scales with the answer, not the codebase.
get_change_impact("TestCase") on CPython found 154 direct dependents and 492 transitive dependents in 0.45ms — the kind of query that's impossible without a dependency graph. Use max_direct and max_transitive to cap output to your token budget.
Query Response Time
All targeted queries return in sub-millisecond time, even on CPython's 1.1M lines:
| Query | RMLPlus | FastAPI | Django | CPython |
|---|---|---|---|---|
find_symbol | 0.01ms | 0.01ms | 0.03ms | 0.08ms |
get_dependencies | 0.00ms | 0.00ms | 0.00ms | 0.01ms |
get_change_impact | 0.02ms | 0.00ms | 2.81ms | 0.45ms |
get_function_source | 0.01ms | 0.02ms | 0.03ms | 0.10ms |
Run the benchmarks yourself: python benchmarks/benchmark.py
How Is This Different from LSP?
LSP answers "where is this function?" — mcp-codebase-index answers "what happens if I change it?" LSP is point queries: one symbol, one file, one position. It can tell you where LLMClient is defined and who references it. But ask "what breaks transitively if I refactor LLMClient?" and LSP has nothing. This tool returns 11 direct dependents and 31 transitive impacts in a single call — 204 characters. To get the same answer from LSP, the AI would need to chain dozens of find-reference calls recursively, reading files at every step, burning thousands of tokens to reconstruct what the dependency graph already knows.
LSP also requires you to install a separate language server for every language in your project — pyright for Python, vtsls for TypeScript, gopls for Go. Each one is a heavyweight binary with its own dependencies and configuration. mcp-codebase-index is zero dependencies, handles Python + TypeScript/JS + Go + Rust + C# + Markdown out of the box, and every response has built-in token budget controls (max_results, max_lines). LSP was built for IDEs. This was built for AI.
Programmatic Usage
from mcp_codebase_index.project_indexer import ProjectIndexer
from mcp_codebase_index.query_api import create_project_query_functions
indexer = ProjectIndexer("/path/to/project", include_patterns=["**/*.py"])
index = indexer.index()
query_funcs = create_project_query_functions(index)
# Use query functions
print(query_funcs["get_project_summary"]())
print(query_funcs["find_symbol"]("MyClass"))
print(query_funcs["get_change_impact"]("some_function"))
Development
pip install -e ".[dev,mcp]"
pytest tests/ -v
ruff check src/ tests/
References
The structural indexer was originally developed as part of the RMLPlus project, an implementation of the Recursive Language Models framework.
License
This project is dual-licensed:
- AGPL-3.0 for open-source use — see LICENSE
- Commercial License for proprietary use — see COMMERCIAL-LICENSE.md
If you're using mcp-codebase-index as a standalone MCP server for development, the AGPL-3.0 license applies at no cost. If you're embedding it in a proprietary product or offering it as part of a hosted service, you'll need a commercial license. See COMMERCIAL-LICENSE.md for details.
Похожие серверы
Alpha Vantage MCP Server
спонсорAccess financial market data: realtime & historical stock, ETF, options, forex, crypto, commodities, fundamentals, technical indicators, & more
llm-mcp
A Ruby gem for integrating Large Language Models (LLMs) via the Model Context Protocol (MCP) into development workflows.
Terminal MCP Server
Execute commands on local or remote hosts via SSH. Supports session persistence and environment variables.
Hoverfly MCP Server
An MCP server exposing Hoverfly as a programmable API simulation tool for AI assistants.
Unity Code MCP Server
Powerful tool for the Unity Editor that gives AI Agents ability to perform any action using Unity Editor API, like modification of scripts, scenes, prefabs, assets, configuration and more.
iOS Device Control
An MCP server to control iOS simulators and real devices, enabling AI assistant integration on macOS.
Ollama MCP Bridge
A bridge API service connecting Ollama with Model Context Protocol (MCP) servers.
Remote Terminal MCP for Cursor
A remote terminal tool for Cursor to manage and connect to remote servers via SSH, jump hosts, and Docker containers.
stdout-mcp-server
Captures and manages stdout logs from multiple processes via a named pipe system for real-time debugging and analysis.
WordPress Standalone
Integrates AI assistants with WordPress sites using the WordPress REST API.
Game Asset Generator
Generate 2D and 3D game assets using AI models hosted on Hugging Face Spaces.