Lojban Semantic Search
local-first semantic search in Lojban dictionaries
Semantic Local MCP
A local-first MCP (Model Context Protocol) server for semantic search over your documents. Index text files (e.g. TSV, CSV, TXT) line-by-line, then search or filter by meaning using embeddings—all on your machine, no API keys required.
Use it in Cursor, Claude Code, or any IDE that supports MCP to search through dictionaries, glossaries, and corpora by semantic similarity.
Use cases
- Lojban (or any) dictionary: Index a TSV where each line is a word/definition. Find entries similar to a phrase or concept, or discover gaps—word combinations or concepts your dictionary doesn't cover yet.
- Glossaries & term bases: "Find entries that mean something like …" without exact keyword match.
- Corpora & line-based data: Any file where each line is a record (TSV, CSV, one-sentence-per-line TXT). Index once, query by meaning.
How it works
- Indexing: On startup, the server indexes content in the background. If
SEMANTIC_SEARCH_INDEX_DIRSis set (comma-separated paths), it scans those directories. If it is not set, the server downloads the lojban/sampu_vlaste repository from GitHub and indexes that instead. In both cases, the server looks for.txt,.md,.tsv,.csvfiles..txt,.tsv,.csv: each non-empty line is one record..md: chunks by paragraphs and blocks—merged multi-line>blockquotes (e.g. Lojban + glosses), whole HTML<table>...</table>blocks, and blank-line-separated prose (including consecutive list items). Latest##/###titles are prepended asContext: …on each chunk for better retrieval. Each chunk gets one embedding (via Hugging Face Transformers.js, modelXenova/all-MiniLM-L6-v2) and is stored in a local SQLite database with @dao-xyz/sqlite3-vec (SQLite + sqlite-vec for Node and browser). Thelinefield in search results is the start line of that chunk in the file. After upgrading to a version that changes chunking, restart the server so files are re-indexed (mtime/content hash refresh). - Search: You send a natural-language query; the server embeds it and returns the closest lines by cosine similarity.
- Storage: Index is stored in your project's
.semantic-search/data/(or setSEMANTIC_SEARCH_DATA_DIR). No cloud, no API keys.
Requirements
- Node.js 18+ (20+ recommended)
- npm or pnpm
First run will download the embedding model (~80MB) and cache it locally.
Use in Cursor IDE
There is no build step and no need to run npm install yourself. The server runs only via npx tsx (TypeScript is run directly). Add a single command to MCP; on first run, npx will download the package and its dependencies, and the server will download the embedding model (~80MB) when you first index or search.
The package is published as @lojban/semantic-search-mcp. (To run from source before/without publishing, see the From source setup in the Development section.)
- Add the MCP server in Cursor:
- Open Settings → Cursor Settings → MCP (or edit
~/.cursor/mcp.json). - Add:
{
"mcpServers": {
"semantic-search": {
"command": "npx",
"args": ["-y", "@lojban/semantic-search-mcp"]
}
}
}
Nocwdneeded: the server stores its index in your project directory (.semantic-search/data/), so open your project in Cursor and the index is per-workspace. To use a fixed data directory instead, add"env": { "SEMANTIC_SEARCH_DATA_DIR": "/path/to/data" }. To have the server index specific directories on startup, set"env": { "SEMANTIC_SEARCH_INDEX_DIRS": "./dictionary,./glossary" }(comma-separated paths). If you omitSEMANTIC_SEARCH_INDEX_DIRS, the server will download and index the lojban/sampu_vlaste repo automatically.
- Open Settings → Cursor Settings → MCP (or edit
- Restart Cursor (or reload the window). Indexing starts automatically in the background: from your configured
SEMANTIC_SEARCH_INDEX_DIRS, or from the downloaded sampu_vlaste repo if that env is not set. - In chat or Composer, ask the AI to use the tools:
- Search: "Use semantic-search tool: find combinations of words that can express the concept of …", "Use semantic-search tool: search the index for …" or "Use semantic-search tool: Find entries similar to …"
- Stats: "use semantic-search mcp. run get_index_stats" — stats include progress and start time (locale-formatted) when indexing is in progress.
The AI will call search and get_index_stats for you.
Use in other AI IDEs (Claude Code, etc.)
Any environment that supports MCP over stdio can use this server. Run:
- One-liner:
npx -y @lojban/semantic-search-mcp— dependencies are installed on first run; index is stored in the current working directory's.semantic-search/data/. Set envSEMANTIC_SEARCH_INDEX_DIRS(comma-separated paths) to index those directories on startup; if unset, the server downloads and indexes lojban/sampu_vlaste from GitHub. Tools:search,get_index_stats.
From source: Clone the repo, run npm install once, then use "command": "npx", "args": ["tsx", "src/index.ts"], "cwd": "/path/to/semantic-search-mcp" or "command": "node", "args": ["/path/to/semantic-search-mcp/run.mjs"] (no cwd needed with the latter). See MCP_SETUP.md for details.
MCP tools
| Tool | Description |
|---|---|
| search | Semantic search: query (string), optional limit (default 10). Returns file path, line number, content, and similarity score. |
| get_index_stats | Returns total number of indexed files and lines. When indexing is running in the background, also returns progress: indexing.started_at (locale-formatted), lines_indexed_so_far, files_indexed_so_far, and in_progress. |
Indexing on startup
- With your own dirs: Set the environment variable
SEMANTIC_SEARCH_INDEX_DIRSto a comma-separated list of directories to index. When the MCP server starts, it begins indexing those directories in the background (async). - Default (no env set): If
SEMANTIC_SEARCH_INDEX_DIRSis not set, the server downloads the lojban/sampu_vlaste repository from GitHub (as a zip), extracts it under.semantic-search/sampu_vlaste/, and indexes that. The download is cached; subsequent starts reuse the cached copy.
The index is cleared and rebuilt each time the server starts. Use absolute paths or paths relative to the server's working directory when setting SEMANTIC_SEARCH_INDEX_DIRS. The server reads and indexes all supported .txt, .md, .tsv, .csv files under each directory recursively. Indexing uses bounded memory and yields to the event loop so the OS stays responsive.
Example: Lojban dictionary gaps
- Put your dictionary TSV (e.g.
jbo-eng.tsv) in a folder (e.g../dictionary). - Set
SEMANTIC_SEARCH_INDEX_DIRS=./dictionaryin your MCP config (or in the environment). Restart the server; indexing runs in the background. - In Cursor: "Search for entries similar to 'to cause to become warm' and limit 20."
- Or: "Search for 'emotional state of joy' and show me what we have; then suggest word combinations the dictionary might be missing."
The index is stored in .semantic-search/data/vectors.db (or your project root). Restart the server to re-index when you add or change files.
Development
The server is not built to JavaScript; it runs via npx tsx src/index.ts or node run.mjs. No tsc or node dist/ usage.
From source (e.g. before publishing to npm):
- Run
npm installonce in the repo. - In MCP config use either:
"command": "npx", "args": ["tsx", "src/index.ts"], "cwd": "/path/to/semantic-search-mcp", or"command": "node", "args": ["/path/to/semantic-search-mcp/run.mjs"](run.mjs setscwdautomatically; see MCP_SETUP.md).
To run the server from the repo: npm run dev or npx tsx src/index.ts.
Run tests: npm test.
License
MIT
İlgili Sunucular
Gemini Web Search
Performs web searches using the Gemini Web Search Tool via the local gemini-cli.
Perplexity MCP Server
Perform real-time internet research with source citations using the Perplexity API.
eBird MCP Server
Query rich bird observation data from the eBird API using natural language.
Dartpoint
Access public disclosure information for Korean companies (DART) using the dartpoint.ai API.
Product Hunt
Discover and search for the latest products and tech using the Product Hunt API.
ClaimHit
Patent Infringement MCP Server
google-maps-mcp-server
STDIO-based MCP server for Google Maps Platform APIs
Gaokao Ranking Query
Query Gaokao (Chinese college entrance exam) rankings within provinces based on score, year, and category.
Exa
Exa AI Search API
Untappd
Query the Untappd API for beer and brewery information.