hadith-mcp
Model Context Protocol server for searchable, citation-safe hadith text.
hadith-mcp
Model Context Protocol (MCP) server and data pipeline for serving canonical hadith text (Arabic and English) to assistants in a citation-safe way—similar in spirit to quran-mcp: fetch from a real corpus instead of quoting from model memory.
This repository provides a FastMCP server over data/hadith.db plus a data pipeline to build that database: normalized SQLite, OpenAI embeddings (text-embedding-3-large), cross-collection references (cosine similarity + narrator-aware scoring), and provenance-style tags (e.g. muttafaq-style links between Sahih al-Bukhari and Sahih Muslim).
Data sources and credits
- Hadith text comes from the community hadith-json dataset (scraped from Sunnah.com), which aligns with the broader sunnah-com / Quran Foundation ecosystem—the same family of sources behind quran-mcp.
- Architecture and patterns are inspired by quran-mcp (FastMCP, grounding mindset, tooling layout).
If you ship a product or paper, keep upstream attribution visible (dataset authors, Sunnah.com, and the scholarly collections themselves).
Repository layout
| Path | Purpose |
|---|---|
scripts/build_db.py | Load hadith-json db/by_book JSON → SQLite schema, optional embed, cross-ref, provenance |
scripts/embed_hadith.py | Resume-only embedding for rows with embedding IS NULL (slow, checkpoint-friendly) |
scripts/merge_embedding_checkpoints.py | Replay JSONL embedding checkpoints into hadith.db after crashes or restores |
scripts/compute_crossref.py | Recompute cross_references + provenance only (does not re-import JSON; safe after embed) |
scripts/fetch_ext_apps.py | Vendor / refresh @modelcontextprotocol/ext-apps as a classic script (sets window.__hadithMcpSdk) used by the interactive reader |
scripts/generate_search_sitemap.py | Regenerate search/sitemap.xml (index) + search/sitemaps/*.xml (~50k ?id= URLs) from data/hadith.db for SEO after DB changes |
src/hadith_mcp/pipeline/ | Loaders, schema, embed, cross-reference, provenance logic |
src/hadith_mcp/server.py | FastMCP app: MCP tools + a small REST surface (/api/collections, /api/hadith/{id}, /api/hadith/{slug}/{n}, /api/search) reusing the same store and embedding index |
src/hadith_mcp/assets/hadith_app.html | Self-contained MCP App UI template (inline CSS + app logic, system fonts only) served at ui://hadith.html for the show_hadith tool |
src/hadith_mcp/assets/ext-apps.bundle.js | Vendored ext-apps SDK (zero external imports) inlined into hadith_app.html at resource-render time |
search/ | Static search frontend (HTML/CSS/JS) deployed standalone (e.g. search.hadith-mcp.org) |
site/ | Static landing page for the main domain |
config.yml | Optional default DB path (overridden by HADITH_MCP_DB_PATH) |
Large reference trees hadith-json-main/ and quran-mcp-master/ are listed in .gitignore. Clone or unpack hadith-json locally (for example as hadith-json-main/) or pass --data-dir to build_db.py.
Quick start
python3 -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"
cp .env.example .env # set OPENAI_API_KEY for embedding steps
1) Build the database (without calling OpenAI)
Point --data-dir at your local hadith-json db/by_book directory.
python scripts/build_db.py --fresh --skip-embed --skip-cross --skip-provenance \
--data-dir ./hadith-json-main/db/by_book
2) Embeddings (long run; use a separate machine if you prefer)
Safe defaults: batch size 1, commit every 10 rows, sleep between calls, optional JSONL checkpoint for safety.
python scripts/embed_hadith.py \
--db-path ./data/hadith.db \
--checkpoint ./data/embeddings_checkpoint.jsonl \
--batch-size 1 \
--commit-every 10 \
--sleep-between-batches 0.15
Replay checkpoints into the DB when needed:
python scripts/merge_embedding_checkpoints.py --db-path ./data/hadith.db \
./data/embeddings_checkpoint.jsonl --only-missing
3) Cross-references and provenance (local CPU)
Do not re-run build_db.py without --fresh after embedding unless you intend to re-import JSON (that path can overwrite rows and clear embedding). Instead:
python scripts/compute_crossref.py --db-path ./data/hadith.db
For a single-machine full build (import + embed + cross + provenance), run build_db.py once without --skip-embed / --skip-cross / --skip-provenance, and pass embedding pacing flags as needed (python scripts/build_db.py --help).
4) MCP server (stdio for Cursor / Claude Desktop)
From the repo root with data/hadith.db present (or set HADITH_MCP_DB_PATH):
hadith-mcp --transport stdio
# or: python -m hadith_mcp --transport stdio
# or: fastmcp run hadith_mcp.server:mcp
Optional --config config.yml sets database.path relative to the config file. HADITH_MCP_DB_PATH overrides both.
HTTP / SSE / streamable HTTP (see FastMCP docs for host/port env vars):
hadith-mcp --transport http
Tools (summary): fetch_grounding_rules returns full text once per MCP session (then a short repeat unless force_full=True); pass returned nonce only when you need to disambiguate errors. fetch_hadith accepts global hadith_id or collection + hadith_number (int, or string range like 1-5) with optional include_cross_references. The collection argument is resolved through a forgiving slug matcher — canonical slugs (bukhari), common variants (sahih-bukhari, Sahih al-Bukhari, sahih_bukhari), and human names (Sunan Abu Dawud, Musnad Ahmad, 40 Hadith Nawawi) all map to the same row. search_hadith defaults to semantic search (loads all embeddings at startup, embeds the query with the configured query embedding model); use mode=keyword for SQL substring search, or mode=both. Semantic search needs OPENAI_API_KEY and a database whose rows include embeddings. If OpenAI returns quota/billing/rate-limit errors (or the query vector size does not match the DB), the server falls back to keyword search instead of failing. fetch_cross_references returns algorithmic similarity matches across collections for a given hadith. show_hadith opens an interactive Hadith Reader MCP App in supported hosts (ChatGPT Developer Apps, Claude with app support, etc.); prefer calling it with the canonical hadith_id returned by fetch_hadith / search_hadith / fetch_cross_references — collection + hadith_number and free-text query are supported fallbacks, and the tool always returns a plain-text fallback so non-App hosts still get a readable answer with the same citation URLs. The top-level MCP instructions nudge assistants toward the two-step flow (look up first, then show_hadith(hadith_id=…)) to avoid guessing slugs or numbers from memory. Optional per-client rate limits and an LRU query cache reduce cost and abuse (see config.yml / .env.example).
Citation URLs. fetch_hadith, search_hadith, fetch_cross_references, and show_hadith attach a url field to each hadith or cross-reference row pointing at the search frontend (https://search.hadith-mcp.org/?id=<db_id> by default, overridable via HADITH_SEARCH_APP_URL). The server's MCP instructions tell assistants to always surface this link alongside citations and to never fabricate links to external hadith sites (sunnah.com, etc.).
Interactive reader (show_hadith). The tool binds to a ui://hadith.html resource served as text/html;profile=mcp-app. The HTML template lives at src/hadith_mcp/assets/hadith_app.html and is fully self-contained — inline CSS, inline app logic, system fonts only, zero CDN fetches, zero cross-origin iframes. The @modelcontextprotocol/ext-apps SDK is vendored under src/hadith_mcp/assets/ext-apps.bundle.js (rewritten to a classic script that attaches to window.__hadithMcpSdk) and spliced into the template at startup, so the whole widget ships as a single HTML document. Refresh the pinned SDK version with python3 scripts/fetch_ext_apps.py --version <x.y.z>. The resource meta sets ui.csp.resourceDomains = [] (no external origins at runtime) and intentionally omits ui.domain because ChatGPT and Claude require incompatible formats for that field (ChatGPT wants any https://… URL; Claude requires a sha256-derived *.claudemcpcontent.com subdomain and errors with "App domain configuration is invalid" on anything else) — both hosts work correctly when the field is omitted. Once mounted, the embedded app calls fetch_hadith and search_hadith over the MCP bridge (no extra HTTPS) to let users open cross-references and switch between detail and search views without LLM round-trips. A single-hadith show_hadith call renders pure card chrome (no search bar); calls with a query or no arguments render the search-bar + results UI.
5) Public search frontend and REST API
The repo ships a small static search app in search/ and an HTTP REST surface on the same FastMCP process, intended to be deployed as two subdomains (e.g. search.hadith-mcp.org and api.hadith-mcp.org) with nginx / Caddy proxying /api/* to the FastMCP port.
- Frontend (
search/): plain HTML/CSS/JS, no build step. Bootstraps from?id=<db_id>or?q=<query>on load, so the URLs MCP tools emit resolve directly. The API base defaults tohttps://api.hadith-mcp.org; override in the browser viawindow.HADITH_API_BASE(set beforescript.jsloads) for local or staging deployments.search/sitemap.xmlis a sitemap index; per-collection URL lists live undersearch/sitemaps/— regenerate withpython3 scripts/generate_search_sitemap.pyafter rebuilding the database. - REST endpoints (same process, mounted via
@mcp.custom_route):GET /api/collections→{collections: [...]}GET /api/hadith/{hadith_id}→{hadith: {...}}GET /api/hadith/{slug}/{id_in_book}→{hadith: {...}}GET /api/search?q=&limit=&collection=→{results, mode, note}; semantic by default with the same keyword fallback behavior as the MCP tool. SharesHADITH_MCP_RATE_LIMIT_SEARCH_RPMand the query cache with MCP clients, so one budget covers both surfaces.
Configuration
- Secrets:
.envis gitignored; see.env.exampleforOPENAI_API_KEYand MCP tuning (HADITH_MCP_QUERY_EMBEDDING_MODEL,HADITH_MCP_RATE_LIMIT_SEARCH_RPM,HADITH_MCP_SEARCH_CACHE_MAX). - Hosted MCP: Put
OPENAI_API_KEYon the server only if you accept paying for query embeddings; tuneHADITH_MCP_RATE_LIMIT_SEARCH_RPM(e.g.30–120) and cache size. The query model must match the dimension of vectors stored inhadith.db(this repo’s build usestext-embedding-3-large/ 3072). A cheaper OpenAI model generally means rebuilding the database with that model so dimensions align. - Artifacts: Other
data/*.dbfiles and embedding checkpoint globs are gitignored; this repo tracksdata/hadith.db(Git LFS) plusdata/SHA256SUMSfor verification (cd data && sha256sum -c SHA256SUMS). - Embeddings: Rows with empty English narrator and text still embed using Arabic text when present. Long inputs are clipped with tiktoken (
cl100k_base) to stay under the 8192-token API limit, with a further shrink ladder if a row still hits length errors. - Count rows without the
sqlite3CLI:
python -c "import sqlite3; c=sqlite3.connect('data/hadith.db'); print(c.execute('SELECT COUNT(*) FROM hadiths WHERE embedding IS NULL').fetchone()[0])"
License
- Software in this repository (Python, scripts, and documentation we added) is licensed under GNU General Public License v3.0 only (SPDX:
GPL-3.0-only). - Hadith text and other upstream material remain under their original terms (hadith-json, Sunnah.com). Our GPL applies to our code, not to a relicensing of that content; keep attribution and follow upstream rules when you redistribute data or excerpts.
Releases and data integrity
- Checksum:
data/SHA256SUMSlistshadith.db. After cloning or downloading the database, runcd data && sha256sum -c SHA256SUMS. - Signing (optional): A detached GPG or Sigstore signature over the checksum file or the database proves who published the bytes and that they were not altered afterward. Signing does not certify scholarly accuracy of every narration or automated cross-reference.
- Reproducibility: For audits or rebuilds, record the hadith-json revision, this repo’s git revision, the embedding model id, and script versions you used.
Contributing
Issues and PRs welcome. Please keep diffs focused and match existing style (ruff / pytest when present).
Server Terkait
DuckDuckGo Search
Provides web search functionality using the DuckDuckGo Search API.
Polymarket Trading MCP
Trading intelligence tools for Polymarket prediction markets: Slippage estimation, liquidity scanning, arbitrage detection, price feeds, wallet intelligence, and portfolio risk.
Yandex Search
A web search server that uses the Yandex Search API.
Marketaux
Search for market news and financial data by entity, country, industry, or symbol using the Marketaux API.
ArXiv-MCP
Search and retrieve academic papers from arXiv based on keywords.
AllTrails
Search for hiking trails and get detailed trail information from AllTrails.
Boring News
Fetches the latest news headlines from the Boring News API.
Web Search
A server that provides web search capabilities using OpenAI models.
ClinicalTrials MCP Server
Search and access clinical trial data from ClinicalTrials.gov.
招投标大数据服务
Provides comprehensive information queries for enterprise qualification certificates, including honors, administrative licenses, and profiles.