hadith-mcp
Model Context Protocol server for searchable, citation-safe hadith text.
hadith-mcp
Model Context Protocol (MCP) server and data pipeline for serving canonical hadith text (Arabic and English) to assistants in a citation-safe way—similar in spirit to quran-mcp: fetch from a real corpus instead of quoting from model memory.
This repository provides a FastMCP server over data/hadith.db plus a data pipeline to build that database: normalized SQLite, OpenAI embeddings (text-embedding-3-large), cross-collection references (cosine similarity + narrator-aware scoring), and provenance-style tags (e.g. muttafaq-style links between Sahih al-Bukhari and Sahih Muslim).
Data sources and credits
- Hadith text comes from the community hadith-json dataset (scraped from Sunnah.com), which aligns with the broader sunnah-com / Quran Foundation ecosystem—the same family of sources behind quran-mcp.
- Architecture and patterns are inspired by quran-mcp (FastMCP, grounding mindset, tooling layout).
If you ship a product or paper, keep upstream attribution visible (dataset authors, Sunnah.com, and the scholarly collections themselves).
Repository layout
| Path | Purpose |
|---|---|
scripts/build_db.py | Load hadith-json db/by_book JSON → SQLite schema, optional embed, cross-ref, provenance |
scripts/embed_hadith.py | Resume-only embedding for rows with embedding IS NULL (slow, checkpoint-friendly) |
scripts/merge_embedding_checkpoints.py | Replay JSONL embedding checkpoints into hadith.db after crashes or restores |
scripts/compute_crossref.py | Recompute cross_references + provenance only (does not re-import JSON; safe after embed) |
src/hadith_mcp/pipeline/ | Loaders, schema, embed, cross-reference, provenance logic |
src/hadith_mcp/server.py | FastMCP app: MCP tools + a small REST surface (/api/collections, /api/hadith/{id}, /api/hadith/{slug}/{n}, /api/search) reusing the same store and embedding index |
search/ | Static search frontend (HTML/CSS/JS) deployed standalone (e.g. search.hadith-mcp.org) |
site/ | Static landing page for the main domain |
config.yml | Optional default DB path (overridden by HADITH_MCP_DB_PATH) |
Large reference trees hadith-json-main/ and quran-mcp-master/ are listed in .gitignore. Clone or unpack hadith-json locally (for example as hadith-json-main/) or pass --data-dir to build_db.py.
Quick start
python3 -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"
cp .env.example .env # set OPENAI_API_KEY for embedding steps
1) Build the database (without calling OpenAI)
Point --data-dir at your local hadith-json db/by_book directory.
python scripts/build_db.py --fresh --skip-embed --skip-cross --skip-provenance \
--data-dir ./hadith-json-main/db/by_book
2) Embeddings (long run; use a separate machine if you prefer)
Safe defaults: batch size 1, commit every 10 rows, sleep between calls, optional JSONL checkpoint for safety.
python scripts/embed_hadith.py \
--db-path ./data/hadith.db \
--checkpoint ./data/embeddings_checkpoint.jsonl \
--batch-size 1 \
--commit-every 10 \
--sleep-between-batches 0.15
Replay checkpoints into the DB when needed:
python scripts/merge_embedding_checkpoints.py --db-path ./data/hadith.db \
./data/embeddings_checkpoint.jsonl --only-missing
3) Cross-references and provenance (local CPU)
Do not re-run build_db.py without --fresh after embedding unless you intend to re-import JSON (that path can overwrite rows and clear embedding). Instead:
python scripts/compute_crossref.py --db-path ./data/hadith.db
For a single-machine full build (import + embed + cross + provenance), run build_db.py once without --skip-embed / --skip-cross / --skip-provenance, and pass embedding pacing flags as needed (python scripts/build_db.py --help).
4) MCP server (stdio for Cursor / Claude Desktop)
From the repo root with data/hadith.db present (or set HADITH_MCP_DB_PATH):
hadith-mcp --transport stdio
# or: python -m hadith_mcp --transport stdio
# or: fastmcp run hadith_mcp.server:mcp
Optional --config config.yml sets database.path relative to the config file. HADITH_MCP_DB_PATH overrides both.
HTTP / SSE / streamable HTTP (see FastMCP docs for host/port env vars):
hadith-mcp --transport http
Tools (summary): fetch_grounding_rules returns full text once per MCP session (then a short repeat unless force_full=True); pass returned nonce only when you need to disambiguate errors. fetch_hadith accepts global hadith_id or collection (slug or English name) + hadith_number (int, or string range like 1-5) with optional include_cross_references. search_hadith defaults to semantic search (loads all embeddings at startup, embeds the query with the configured query embedding model); use mode=keyword for SQL substring search, or mode=both. Semantic search needs OPENAI_API_KEY and a database whose rows include embeddings. If OpenAI returns quota/billing/rate-limit errors (or the query vector size does not match the DB), the server falls back to keyword search instead of failing. Optional per-client rate limits and an LRU query cache reduce cost and abuse (see config.yml / .env.example).
Citation URLs. fetch_hadith, search_hadith, and fetch_cross_references attach a url field to each hadith or cross-reference row pointing at the search frontend (https://search.hadith-mcp.org/?id=<db_id> by default, overridable via HADITH_SEARCH_APP_URL). The server's MCP instructions tell assistants to always surface this link alongside citations and to never fabricate links to external hadith sites (sunnah.com, etc.).
5) Public search frontend and REST API
The repo ships a small static search app in search/ and an HTTP REST surface on the same FastMCP process, intended to be deployed as two subdomains (e.g. search.hadith-mcp.org and api.hadith-mcp.org) with nginx / Caddy proxying /api/* to the FastMCP port.
- Frontend (
search/): plain HTML/CSS/JS, no build step. Bootstraps from?id=<db_id>or?q=<query>on load, so the URLs MCP tools emit resolve directly. The API base defaults tohttps://api.hadith-mcp.org; override in the browser viawindow.HADITH_API_BASE(set beforescript.jsloads) for local or staging deployments. - REST endpoints (same process, mounted via
@mcp.custom_route):GET /api/collections→{collections: [...]}GET /api/hadith/{hadith_id}→{hadith: {...}}GET /api/hadith/{slug}/{id_in_book}→{hadith: {...}}GET /api/search?q=&limit=&collection=→{results, mode, note}; semantic by default with the same keyword fallback behavior as the MCP tool. SharesHADITH_MCP_RATE_LIMIT_SEARCH_RPMand the query cache with MCP clients, so one budget covers both surfaces.
Configuration
- Secrets:
.envis gitignored; see.env.exampleforOPENAI_API_KEYand MCP tuning (HADITH_MCP_QUERY_EMBEDDING_MODEL,HADITH_MCP_RATE_LIMIT_SEARCH_RPM,HADITH_MCP_SEARCH_CACHE_MAX). - Hosted MCP: Put
OPENAI_API_KEYon the server only if you accept paying for query embeddings; tuneHADITH_MCP_RATE_LIMIT_SEARCH_RPM(e.g.30–120) and cache size. The query model must match the dimension of vectors stored inhadith.db(this repo’s build usestext-embedding-3-large/ 3072). A cheaper OpenAI model generally means rebuilding the database with that model so dimensions align. - Artifacts: Other
data/*.dbfiles and embedding checkpoint globs are gitignored; this repo tracksdata/hadith.db(Git LFS) plusdata/SHA256SUMSfor verification (cd data && sha256sum -c SHA256SUMS). - Embeddings: Rows with empty English narrator and text still embed using Arabic text when present. Long inputs are clipped with tiktoken (
cl100k_base) to stay under the 8192-token API limit, with a further shrink ladder if a row still hits length errors. - Count rows without the
sqlite3CLI:
python -c "import sqlite3; c=sqlite3.connect('data/hadith.db'); print(c.execute('SELECT COUNT(*) FROM hadiths WHERE embedding IS NULL').fetchone()[0])"
License
- Software in this repository (Python, scripts, and documentation we added) is licensed under GNU General Public License v3.0 only (SPDX:
GPL-3.0-only). - Hadith text and other upstream material remain under their original terms (hadith-json, Sunnah.com). Our GPL applies to our code, not to a relicensing of that content; keep attribution and follow upstream rules when you redistribute data or excerpts.
Releases and data integrity
- Checksum:
data/SHA256SUMSlistshadith.db. After cloning or downloading the database, runcd data && sha256sum -c SHA256SUMS. - Signing (optional): A detached GPG or Sigstore signature over the checksum file or the database proves who published the bytes and that they were not altered afterward. Signing does not certify scholarly accuracy of every narration or automated cross-reference.
- Reproducibility: For audits or rebuilds, record the hadith-json revision, this repo’s git revision, the embedding model id, and script versions you used.
Contributing
Issues and PRs welcome. Please keep diffs focused and match existing style (ruff / pytest when present).
Related Servers
mcpdoc
Access website documentation for AI search engines (llms.txt files) over MCP.
ArtistLens
Access the Spotify Web API to search and retrieve information about tracks, albums, artists, and playlists.
arXiv MCP Server
Search and analyze academic papers on arXiv.
news-aggregator-mcp-server
Multi-source news aggregation for AI agents — RSS/Atom feeds (16 sources), HackerNews, and GDELT global news intelligence in 65+ languages. No API key required.
Perplexity AI
An MCP server to interact with Perplexity AI's language models for search and conversational AI.
Local RAG
Performs a local RAG search on your query using live web search for context extraction.
Plex MCP Server
Search your Plex media library. Supports OAuth and static token authentication.
Google Maps MCP Server
Local business search and lead generation via Google Maps
Contextual MCP Server
A server for Retrieval-Augmented Generation (RAG) using the Contextual AI platform.
Brave Search
An MCP server for web and local search using the Brave Search API.