semcode
About Semantic code-search (semcode) MCP. Indexes code symbols and commit history. Combines dense embeddings with sparse BM25 vectors for hybrid search that balances semantic understanding with keyword precision.
semcode
An MCP (Model Context Protocol) server providing hybrid semantic search over code across a set of
GitHub repositories that you list in config.yaml. It parses symbols
with Tree-sitter and indexes both code and git commit history, so AI clients can query them by
natural language or by symbol name.
Hybrid retrieval combines dense embeddings with BM25, so both natural-language queries
("where do we publish order events?") and symbol-name lookups (PlaceOrderRequest) work well.
Submitted on mcpservers.org
How it works
- Fetches source files from configured GitHub repositories
- Parses code symbols (functions, classes, methods, components) using Tree-sitter
- Generates two embeddings per symbol — a dense semantic vector (pluggable provider: Jina Code V2 by default, or Voyage / OpenAI / Ollama) and a BM25 sparse vector keyed on code-identifier tokens (camelCase / snake_case split into subwords)
- Stores both in Qdrant and retrieves them with hybrid search — Reciprocal Rank Fusion (RRF) over the dense and sparse results — so natural-language queries and symbol-name lookups both work well
- Optionally indexes commit history into a separate Qdrant collection (dense-only)
- Exposes search and indexing tools through the MCP protocol (and a small HTTP API)
Indexing is incremental — files are skipped when their Git blob SHA matches the last indexed version.
Files that no longer exist (or parse to zero symbols) are cleaned up automatically. Pass force: true
to re-embed everything.
Supported languages
Language is detected automatically from file extension or filename — no configuration needed.
Go, Java, Python, TypeScript / JavaScript (React), Rust, C#, C, C++, Ruby, PHP, Kotlin, Scala, Swift, Dart, Bash, SQL, Lua, R, Dockerfile, Docker Compose, Markdown, JSON, HTML, CSS, XML.
Most parsers are framework-aware where it matters — Spring stereotypes and HTTP routes for Java/Kotlin, FastAPI/Pydantic
for Python, ASP.NET for C#, Rails for Ruby, Laravel/Symfony for PHP, React/SwiftUI/Flutter widgets, etc. See
server/parser/ for the per-language extraction details.
Setup
Prerequisites: Python 3.12+, Docker, GitHub token
# Install dependencies
uv sync
# Copy environment file, then edit .env to set GITHUB_TOKEN
# (a fine-grained PAT with Contents: read on the target repos is sufficient)
cp .env.example .env
# Copy services config, then list the repositories you want indexed
cp config.example.yaml config.yaml
Configure which repositories to index in config.yaml:
services:
- name: my-service
github_repo: owner/repo
github_ref: main # optional, defaults to "main" — branch, tag, or commit SHA
root: src/main/java # optional — limit indexing to this subdirectory (useful for monorepos)
exclude: # optional — skip matching paths
- "**/vendor/**"
- "**/node_modules/**"
The indexer automatically discovers and indexes all files with recognised extensions. Use root to scope a service to a
subdirectory within a shared repo, and exclude to skip paths you don't want indexed (tests, build artifacts, generated
code, etc.).
Running
There are two ways to run, depending on whether you want embeddings to come from a local container or a hosted provider. Pick one:
Path A — local Jina via TEI (default, no API key required):
make docker-up-jina
# or: docker-compose --profile jina up
Path B — hosted provider (Voyage / OpenAI) or local Ollama:
# 1. In .env, set EMBEDDINGS_PROVIDER=voyage|openai|ollama and the relevant API key.
# 2. Then start without the jina profile:
make docker-up
# or: docker-compose up
⚠ The default
EMBEDDINGS_PROVIDERisjina. If you start without--profile jinabut leave the provider on the default, semcode will boot (Jina isrequired: falsein compose) but the first embedding call will fail with a connection error — there's no auto-fallback.
Services started with health checks and persistent volumes:
| Service | Profile | Port | Volume | Purpose |
|---|---|---|---|---|
| Qdrant | always | 6333 (HTTP), 6334 (gRPC) | qdrant_data | Vector DB |
| Jina Embeddings (TEI) | jina | 8087 | embeddings_cache | Embedding model server |
| semcode MCP | always | 8090 | mounts ./config.yaml read-only | MCP + HTTP server |
The MCP server starts with empty collections — trigger an initial index by calling the reindex MCP tool
or POST /reindex (see below).
Connecting AI clients
Once the server is running, point your AI client at http://localhost:8090/mcp.
Claude Code (CLI)
claude mcp add --transport http semcode http://localhost:8090/mcp
Other MCP clients (Claude Desktop, Cursor, etc.) — add an entry to the client's MCP config:
{
"mcpServers": {
"semcode": {
"transport": "http",
"url": "http://localhost:8090/mcp"
}
}
}
Indexing
The indexing pipeline is symbol-oriented: each function, class, method, or component becomes its own chunk with a vector embedding and a rich payload.
- Discovery — lists all files in the repo at
github_ref, applyingrootandexcludefilters - Change detection — compares the file's Git blob SHA to the last indexed value; unchanged files are skipped
- Parsing — Tree-sitter walks the AST and emits
CodeSymbolobjects per language - Dense embedding text — language label, symbol kind, parent class, package, framework extras (Spring stereotype, HTTP route, Lombok, React memo), docstring, signature, and source (source truncated at ~6000 chars)
- Sparse (BM25) embedding text — signature, docstring, and source. Code identifiers are split into subwords (
camelCase, snake_case) before tokenization, so
getUserByIdindexes asget,user,by,idas well as the full token - Batching — dense provider batches at 32 (Jina/TEI, Ollama) or 128 (Voyage, OpenAI); BM25 runs in-process
- Upsert — both vectors stored under one point in Qdrant, keyed by a deterministic UUID (per service / file / symbol / line)
- Cleanup — entries for files no longer in the repo (or that now parse to zero symbols) are deleted
Git history indexing is a separate, optional pipeline that embeds commit messages and changed file
paths into the git_commits collection. Full unified diffs are stored in the payload and retrievable
via the get_commit tool. The number of commits per service is capped by GIT_HISTORY_MAX_COMMITS
(default 500).
Tests
uv sync --group dev
uv run pytest
Tests live under tests/:
tests/parser/test_*.py— one file per language; snapshots parser behavior against canonical fixtures intests/fixtures/<language>/tests/test_pipeline.py,tests/test_store.py,tests/test_git_history.py— integration tests for the indexing pipeline and Qdrant storetests/test_reindex_route.py— HTTP route tests
MCP Tools
| Tool | Description |
|---|---|
search_code | Hybrid (dense + BM25) search by query, with optional filters for language, service, symbol type |
find_symbol | Look up a symbol by name — exact match, or case-insensitive substring when exact=false |
find_usages | Find code that references a given symbol name (semantic search, then excludes the definition itself) |
get_code_context | Fetch the full source of a file — or a specific symbol within it — directly from GitHub |
reindex | Trigger code indexing of one or all services (incremental by default; force to re-embed) |
index_history | Index git commit history; automatically fetches diffs for commits missing them |
search_commits | Search git commit history with natural language |
get_commit | Get full details for a specific commit including changed files and diffs |
list_indexed_services | List indexed services with chunk and file counts, languages, and last-indexed time |
index_stats | Show Qdrant collection statistics and configured services |
MCP Prompts
| Prompt | Arguments | Description |
|---|---|---|
service_overview | service | Walks the client through producing an architectural overview of a service: HTTP entry points, domain types, and notable framework conventions |
system_design_overview | (none) | Walks the client through producing a complete system design overview: service inventory, communication topology, shared data stores, and cross-cutting concerns — includes Mermaid diagrams |
HTTP API
In addition to the MCP tools, the server exposes two HTTP endpoints for triggering indexing from CI/CD or external schedulers:
| Endpoint | Body | Description |
|---|---|---|
POST /reindex | {"service": "<name>"?, "force": <bool>?} | Reindex one or all services — returns NDJSON |
POST /reindex-history | {"service": "<name>"?, "force": <bool>?} | Index git commit history — returns NDJSON |
All bodies are optional — omit service to act on all services, omit force for incremental indexing.
Both endpoints stream newline-delimited JSON (one frame per line) so you can consume progress
in real time from CI/CD pipelines or any other client.
Frame shapes:
// in-flight progress
{"type": "progress", "phase": "discovery|upserting|cleanup", "current": 12, "total": 200, "percentage": 6.0, "service": "my-service"}
// final summary (one per request)
{"type": "done", "result": {"files": 42, "chunks": 318, "skipped": 5}}
// emitted instead of "done" on failure
{"type": "error", "message": "..."}
For /reindex-history the phase value is discovery|embedding|upserting and the done result is
{"new": int, "skipped": int, "diff_updated": int}.
Environment variables
| Variable | Default | Description |
|---|---|---|
GITHUB_TOKEN | (required) | GitHub token with repo read access |
QDRANT_URL | http://localhost:6333 | Qdrant connection URL |
QDRANT_COLLECTION | code_symbols | Collection name for code symbol vectors |
QDRANT_COMMITS_COLLECTION | git_commits | Collection name for commit message vectors |
EMBEDDINGS_PROVIDER | jina | One of jina, jina-api, voyage, openai, ollama — see Embedding providers below |
GIT_HISTORY_MAX_COMMITS | 500 | Max commits indexed per service |
MCP_TRANSPORT | streamable-http | One of streamable-http, sse, stdio |
MCP_HOST / MCP_PORT | 127.0.0.1 / 8090 | Server bind address |
CONFIG_PATH | ./config.yaml | Path to the services config file |
Embedding providers
The embedding backend is selectable via EMBEDDINGS_PROVIDER. Default is jina so existing
deployments keep working unchanged. Each provider derives its own vector dimensions from the
configured model — no need to set dimensions manually unless you want to override.
| Variable | Default | Applies to | Description |
|---|---|---|---|
JINA_URL | http://localhost:8087 | jina | TEI base URL |
JINA_MODEL | jinaai/jina-embeddings-v2-base-code | jina | Informational only — the TEI container's --model-id flag is what actually loads. Edit docker-compose.yaml to change models. |
JINA_DIMENSIONS | 768 | jina | Vector dimensions of the TEI model |
JINA_API_KEY | (required if provider=jina-api) | jina-api | Jina AI API key (hosted endpoint at api.jina.ai) |
JINA_API_MODEL | jina-embeddings-v2-base-code | jina-api | Hosted Jina model — also supports jina-code-embeddings-0.5b, jina-code-embeddings-1.5b |
JINA_API_DIMENSIONS | (native) | jina-api | Optional Matryoshka override (code-embeddings models support shrinking); required for models without a native default |
VOYAGE_API_KEY | (required if provider=voyage) | voyage | Voyage AI API key |
VOYAGE_MODEL | voyage-code-3 | voyage | Voyage embedding model |
VOYAGE_DIMENSIONS | (native) | voyage | Optional override — Voyage code-3 supports 256 / 512 / 1024 / 2048 |
OPENAI_API_KEY | (required if provider=openai) | openai | OpenAI API key |
OPENAI_EMBEDDING_MODEL | text-embedding-3-large | openai | OpenAI embedding model |
OPENAI_DIMENSIONS | (native) | openai | Optional override (text-embedding-3-* models support shrinking) |
OLLAMA_URL | http://localhost:11434 | ollama | Ollama base URL |
OLLAMA_MODEL | nomic-embed-text | ollama | Ollama embedding model |
OLLAMA_DIMENSIONS | (native) | ollama | Required if using a model not in the built-in dimensions table |
voyage-code-3 outperforms jinaai/jina-embeddings-v2-base-code on most code retrieval benchmarks,
so switching to Voyage is also a quality lever, not just a flexibility one.
Switching providers against an existing index: if the new provider's vector size differs from
the existing Qdrant collection, the server fails fast at startup with a clear error pointing at the
offending collection. To switch, drop both collections (code_symbols and git_commits) via the
Qdrant UI or API, then reindex. There is no automatic migration.
Hosted-only setup (no local TEI container): set EMBEDDINGS_PROVIDER and the relevant API key
in .env, then start without the jina profile (docker-compose up / make docker-up). The
jina-embeddings container will not start.
Qdrant collections
code_symbols — one point per parsed symbol, carrying two named vectors:
text-dense— cosine distance, HNSW (m=16,ef_construct=128), size determined by the embedding providertext-sparse— BM25 over code-identifier subword tokens, in-memory sparse index
search_code queries both via a Qdrant query_points call with FusionQuery(fusion=RRF). Indexed
payload fields (language, service, symbol_type, chunk_tier, parent_name, file_path) are
usable as filters. The full payload also includes signature, docstring, annotations, package,
start_line, end_line, file_hash, indexed_at, and language-specific extras (http_method,
http_route, spring_stereotype, lombok_annotations, is_async, uses_memo, …).
git_commits — one dense-only vector per commit (cosine, HNSW m=16 / ef_construct=128).
Payload includes sha, service, message, author_name, author_email, committed_at,
indexed_at, has_diff, diff_truncated, and files (array of changed files with filename,
status, additions, deletions, patch). sha, service, author_name, and has_diff are
indexed payload fields.
Project structure
server/
├── main.py # MCP server entry point + lifespan
├── config.py # Settings and service configuration
├── state.py # Shared store singletons
├── parser/ # Tree-sitter parsers (Go, Java, Python, TypeScript, Rust, C#, C, C++, Ruby, PHP, Kotlin, Scala, Swift, Dart, Bash, SQL, Lua, R, Dockerfile, Compose, Markdown, JSON, HTML, CSS, XML)
├── embeddings/ # Pluggable dense providers (Jina/Voyage/OpenAI/Ollama) + BM25 sparse + code identifier tokenizer
├── indexer/ # GitHub fetcher, code indexing pipeline, git history pipeline
├── store/ # Qdrant vector stores (code_symbols hybrid + git_commits dense)
├── tools/ # MCP tool implementations (search, index, history, admin)
├── prompts/ # MCP prompt templates (service_overview, system_design_overview)
└── routes/ # HTTP routes (reindex, reindex-history)
Related Servers
mcp-domain-radar
MCP server that checks domain availability in real-time during brand naming — never recommend a name without a domain again
Search Intent MCP
Analyzes user search keyword intent for SEO support using the AI Search Intent API.
Baidu Search
A search server for the Model Context Protocol (MCP) that uses the Baidu Wenxin API.
avr-docs-mcp
This MCP (Model Context Protocol) server provides integration with Wiki.JS for searching and listing pages from Agent Voice Response Wiki.JS instance.
Taiwan Jobs Search
for job search in taiwan
newsmcp
Real-time world news for AI agents — events clustered from hundreds of sources, classified by 12 topics and 30+ regions, ranked by importance. Free, no API key.
Deep Research
A server for conducting deep research and generating reports.
Cryptocurrency Price
A service to query real-time cryptocurrency prices.
Jina AI Search
Access Jina AI's Search Foundation APIs for web search, news search, and more, tailored for LLMs.
Ferengi Rules of Acquisition
Provides the Ferengi Rules of Acquisition with powerful search and retrieval capabilities.