Lojban Semantic Search
local-first semantic search in Lojban dictionaries
Semantic Local MCP
A local-first MCP (Model Context Protocol) server for semantic search over your documents. Index text files (e.g. TSV, CSV, TXT) line-by-line, then search or filter by meaning using embeddings—all on your machine, no API keys required.
Use it in Cursor, Claude Code, or any IDE that supports MCP to search through dictionaries, glossaries, and corpora by semantic similarity.
Use cases
- Lojban (or any) dictionary: Index a TSV where each line is a word/definition. Find entries similar to a phrase or concept, or discover gaps—word combinations or concepts your dictionary doesn't cover yet.
- Glossaries & term bases: "Find entries that mean something like …" without exact keyword match.
- Corpora & line-based data: Any file where each line is a record (TSV, CSV, one-sentence-per-line TXT). Index once, query by meaning.
How it works
- Indexing: On startup, the server indexes content in the background. If
SEMANTIC_SEARCH_INDEX_DIRSis set (comma-separated paths), it scans those directories. If it is not set, the server downloads the lojban/sampu_vlaste repository from GitHub and indexes that instead. In both cases, the server looks for.txt,.md,.tsv,.csvfiles..txt,.tsv,.csv: each non-empty line is one record..md: chunks by paragraphs and blocks—merged multi-line>blockquotes (e.g. Lojban + glosses), whole HTML<table>...</table>blocks, and blank-line-separated prose (including consecutive list items). Latest##/###titles are prepended asContext: …on each chunk for better retrieval. Each chunk gets one embedding (via Hugging Face Transformers.js, modelXenova/all-MiniLM-L6-v2) and is stored in a local SQLite database with @dao-xyz/sqlite3-vec (SQLite + sqlite-vec for Node and browser). Thelinefield in search results is the start line of that chunk in the file. After upgrading to a version that changes chunking, restart the server so files are re-indexed (mtime/content hash refresh). - Search: You send a natural-language query; the server embeds it and returns the closest lines by cosine similarity.
- Storage: Index is stored in your project's
.semantic-search/data/(or setSEMANTIC_SEARCH_DATA_DIR). No cloud, no API keys.
Requirements
- Node.js 18+ (20+ recommended)
- npm or pnpm
First run will download the embedding model (~80MB) and cache it locally.
Use in Cursor IDE
There is no build step and no need to run npm install yourself. The server runs only via npx tsx (TypeScript is run directly). Add a single command to MCP; on first run, npx will download the package and its dependencies, and the server will download the embedding model (~80MB) when you first index or search.
The package is published as @lojban/semantic-search-mcp. (To run from source before/without publishing, see the From source setup in the Development section.)
- Add the MCP server in Cursor:
- Open Settings → Cursor Settings → MCP (or edit
~/.cursor/mcp.json). - Add:
{
"mcpServers": {
"semantic-search": {
"command": "npx",
"args": ["-y", "@lojban/semantic-search-mcp"]
}
}
}
Nocwdneeded: the server stores its index in your project directory (.semantic-search/data/), so open your project in Cursor and the index is per-workspace. To use a fixed data directory instead, add"env": { "SEMANTIC_SEARCH_DATA_DIR": "/path/to/data" }. To have the server index specific directories on startup, set"env": { "SEMANTIC_SEARCH_INDEX_DIRS": "./dictionary,./glossary" }(comma-separated paths). If you omitSEMANTIC_SEARCH_INDEX_DIRS, the server will download and index the lojban/sampu_vlaste repo automatically.
- Open Settings → Cursor Settings → MCP (or edit
- Restart Cursor (or reload the window). Indexing starts automatically in the background: from your configured
SEMANTIC_SEARCH_INDEX_DIRS, or from the downloaded sampu_vlaste repo if that env is not set. - In chat or Composer, ask the AI to use the tools:
- Search: "Use semantic-search tool: find combinations of words that can express the concept of …", "Use semantic-search tool: search the index for …" or "Use semantic-search tool: Find entries similar to …"
- Stats: "use semantic-search mcp. run get_index_stats" — stats include progress and start time (locale-formatted) when indexing is in progress.
The AI will call search and get_index_stats for you.
Use in other AI IDEs (Claude Code, etc.)
Any environment that supports MCP over stdio can use this server. Run:
- One-liner:
npx -y @lojban/semantic-search-mcp— dependencies are installed on first run; index is stored in the current working directory's.semantic-search/data/. Set envSEMANTIC_SEARCH_INDEX_DIRS(comma-separated paths) to index those directories on startup; if unset, the server downloads and indexes lojban/sampu_vlaste from GitHub. Tools:search,get_index_stats.
From source: Clone the repo, run npm install once, then use "command": "npx", "args": ["tsx", "src/index.ts"], "cwd": "/path/to/semantic-search-mcp" or "command": "node", "args": ["/path/to/semantic-search-mcp/run.mjs"] (no cwd needed with the latter). See MCP_SETUP.md for details.
MCP tools
| Tool | Description |
|---|---|
| search | Semantic search: query (string), optional limit (default 10). Returns file path, line number, content, and similarity score. |
| get_index_stats | Returns total number of indexed files and lines. When indexing is running in the background, also returns progress: indexing.started_at (locale-formatted), lines_indexed_so_far, files_indexed_so_far, and in_progress. |
Indexing on startup
- With your own dirs: Set the environment variable
SEMANTIC_SEARCH_INDEX_DIRSto a comma-separated list of directories to index. When the MCP server starts, it begins indexing those directories in the background (async). - Default (no env set): If
SEMANTIC_SEARCH_INDEX_DIRSis not set, the server downloads the lojban/sampu_vlaste repository from GitHub (as a zip), extracts it under.semantic-search/sampu_vlaste/, and indexes that. The download is cached; subsequent starts reuse the cached copy.
The index is cleared and rebuilt each time the server starts. Use absolute paths or paths relative to the server's working directory when setting SEMANTIC_SEARCH_INDEX_DIRS. The server reads and indexes all supported .txt, .md, .tsv, .csv files under each directory recursively. Indexing uses bounded memory and yields to the event loop so the OS stays responsive.
Example: Lojban dictionary gaps
- Put your dictionary TSV (e.g.
jbo-eng.tsv) in a folder (e.g../dictionary). - Set
SEMANTIC_SEARCH_INDEX_DIRS=./dictionaryin your MCP config (or in the environment). Restart the server; indexing runs in the background. - In Cursor: "Search for entries similar to 'to cause to become warm' and limit 20."
- Or: "Search for 'emotional state of joy' and show me what we have; then suggest word combinations the dictionary might be missing."
The index is stored in .semantic-search/data/vectors.db (or your project root). Restart the server to re-index when you add or change files.
Development
The server is not built to JavaScript; it runs via npx tsx src/index.ts or node run.mjs. No tsc or node dist/ usage.
From source (e.g. before publishing to npm):
- Run
npm installonce in the repo. - In MCP config use either:
"command": "npx", "args": ["tsx", "src/index.ts"], "cwd": "/path/to/semantic-search-mcp", or"command": "node", "args": ["/path/to/semantic-search-mcp/run.mjs"](run.mjs setscwdautomatically; see MCP_SETUP.md).
To run the server from the repo: npm run dev or npx tsx src/index.ts.
Run tests: npm test.
License
MIT
Servidores relacionados
Tavily Search
A search engine powered by the Tavily AI Search API.
YouTube Data MCP
High-efficiency YouTube MCP server providing token-optimized, structured data for LLMs.
General MCP Server
An MCP server providing search capabilities for Reddit, YouTube, and Twitter.
FlightRadar MCP Server
Provides real-time flight tracking and status information using the AviationStack API.
BytesAgain
Search AI agent skills and MCP servers via MCP or REST API. Free, no auth required. Supports 7 languages.
Agently MCP
Discover public A2A agents on the Agently platform using its public API.
Enhanced Documentation Search
Provides real-time access to documentation, library popularity data, and career insights using the Serper API.
MCP Web Search Tool
A server for real-time web search using pluggable providers, powered by the Brave Search API.
hackernews
A simple MCP server that brings Hacker News into your AI workflows. It exposes a set of tools to fetch top stories, individual posts with comments, and the latest Ask HN / Show HN discussions — all in a clean, structured format that’s easy for agents
Image Sorcery
At Sunrise Apps, we believe AI agents should be limitless, especially when it comes to visual data. We created ImageSorcery to bridge the critical gap in AI's ability to interact with and manipulate images directly, all while upholding the highest standards of privacy and security.