arxiv-mcp-server Server
arXiv paper search and full-text reading
Documentation
@cyanheads/arxiv-mcp-server
Search arXiv, fetch paper metadata, and read full-text content via MCP. STDIO or Streamable HTTP.
Public Hosted Server: https://arxiv.caseyjhand.com/mcp
Tools
Four tools for searching and reading arXiv papers:
| Tool Name | Description |
|---|---|
arxiv_search | Search arXiv papers by query with category and sort filters. |
arxiv_get_metadata | Get full metadata for one or more arXiv papers by ID. |
arxiv_read_paper | Fetch the full text content of an arXiv paper from its HTML rendering. |
arxiv_list_categories | List arXiv category taxonomy, optionally filtered by group. |
arxiv_search
Search for papers using free-text queries with field prefixes and boolean operators.
- Field prefixes:
ti:(title),au:(author),abs:(abstract),cat:(category),all:(all fields) - Boolean operators:
AND,OR,ANDNOT - Optional category filter, sorting (relevance, submitted, updated), and pagination
- Returns up to 50 results per request with full metadata including abstract
arxiv_get_metadata
Fetch full metadata for one or more papers by known arXiv ID.
- Batch fetch up to 10 papers in a single request
- Accepts both versioned (
2401.12345v2) and unversioned (2401.12345) IDs - Legacy ID format supported (
hep-th/9901001) - Reports not-found IDs separately from found papers
arxiv_read_paper
Read the full HTML content of an arXiv paper.
- Tries native arXiv HTML first, falls back to ar5iv for broader coverage
- Strips HTML head/boilerplate and collapses MathML to dollar-delimited LaTeX (
$…$inline,$$…$$block) so the character budget targets paper content max_charactersdefaults to 100,000; raw HTML can be 500KB-3MB+ for math-heavy papers- Returns raw HTML — no parsing or extraction; the LLM interprets content directly
arxiv_list_categories
List arXiv category codes and names for discovery.
- ~155 categories across 8 top-level groups (cs, math, physics, q-bio, q-fin, stat, eess, econ)
- Optional group filter to narrow results
- Static data — always succeeds
Resources
| URI Pattern | Description |
|---|---|
arxiv://paper/{paperId} | Paper metadata by arXiv ID. |
arxiv://categories | Full arXiv category taxonomy. |
Features
Built on @cyanheads/mcp-ts-core:
- Declarative tool definitions — single file per tool, framework handles registration and validation
- Unified error handling across all tools
- Pluggable auth (
none,jwt,oauth) - Structured logging with optional OpenTelemetry tracing
- Runs locally (stdio/HTTP) from the same codebase
arXiv-specific:
- Read-only, no authentication required — arXiv API is free, metadata is CC0
- Rate-limited request queue enforcing arXiv's 3-second crawl delay
- Adaptive cooldown on rate-limit (5s → 10s → 20s → 30s), honors
Retry-After - Retry with exponential backoff for transient failures
- HTML content fallback chain: native arXiv HTML → ar5iv
- Full arXiv category taxonomy embedded as static data
- Optional local OAI-PMH metadata mirror (SQLite + FTS5) — opt-in, eliminates rate-limit exposure for
arxiv_searchandarxiv_get_metadata. See Optional: Local Mirror.
Getting Started
Public Hosted Instance
A public instance is available at https://arxiv.caseyjhand.com/mcp — no installation required. Point any MCP client at it via Streamable HTTP:
{
"mcpServers": {
"arxiv": {
"type": "streamable-http",
"url": "https://arxiv.caseyjhand.com/mcp"
}
}
}
Self-Hosted / Local
Add to your MCP client config (e.g., claude_desktop_config.json):
{
"mcpServers": {
"arxiv": {
"type": "stdio",
"command": "bunx",
"args": ["@cyanheads/arxiv-mcp-server@latest"]
}
}
}
Prerequisites
- Bun v1.3.0 or higher.
Installation
- Clone the repository:
git clone https://github.com/cyanheads/arxiv-mcp-server.git
- Navigate into the directory:
cd arxiv-mcp-server
- Install dependencies:
bun install
Configuration
All configuration is optional — the server works out of the box with sensible defaults.
| Variable | Description | Default |
|---|---|---|
ARXIV_API_BASE_URL | arXiv API base URL. | https://export.arxiv.org/api |
ARXIV_REQUEST_DELAY_MS | Minimum delay between arXiv API requests (ms). | 3000 |
ARXIV_CONTENT_TIMEOUT_MS | Timeout for HTML content fetches (ms). | 30000 |
ARXIV_API_TIMEOUT_MS | Timeout for API search/metadata requests (ms). | 15000 |
ARXIV_MIRROR_ENABLED | Enable local OAI-PMH metadata mirror for search and metadata. | false |
ARXIV_MIRROR_PATH | SQLite path for the mirror. | ./data/arxiv-mirror.db |
ARXIV_MIRROR_REFRESH_CRON | UTC cron expression for in-process daily refresh (HTTP mode only). | unset |
ARXIV_MIRROR_FALLBACK_LIVE | Fall through to live API on local ID-lookup miss. | true |
ARXIV_MIRROR_RECENT_DAYS_LIVE | Route sortBy=submitted descending queries within this window to the live API. | 2 |
ARXIV_MIRROR_OAI_BASE_URL | arXiv OAI-PMH endpoint base URL. | https://oaipmh.arxiv.org/oai |
ARXIV_MIRROR_OAI_REQUEST_DELAY_MS | Minimum delay between OAI-PMH requests (ms). | 3000 |
MCP_TRANSPORT_TYPE | Transport: stdio or http. | stdio |
MCP_HTTP_PORT | Port for HTTP server. | 3010 |
MCP_AUTH_MODE | Auth mode: none, jwt, or oauth. | none |
MCP_LOG_LEVEL | Log level (RFC 5424). | info |
Running the Server
Local Development
-
Build and run:
bun run build bun run start:http # or start:stdio -
Run checks and tests:
bun run devcheck # Lint, format, typecheck, audit bun run test # Vitest
Optional: Local Mirror
For self-hosted deployments behind a single egress IP, arXiv's ~3-second per-IP crawl delay serializes concurrent users. An optional local mirror eliminates rate-limit exposure for arxiv_search and arxiv_get_metadata by serving from a SQLite + FTS5 store harvested via OAI-PMH. arxiv_read_paper continues to use the live API — full-content harvest is forbidden by arXiv's data policy.
Disabled by default. To enable:
# 1. Cold-start harvest (~4.4h sequential, resumable from checkpoint). One-time per installation.
bun run mirror:init
# 2. Enable the mirror.
export ARXIV_MIRROR_ENABLED=true
# 3. Start the server — reads switch to the mirror once the harvest completes.
bun run start:http
Daily incremental refresh (~30s, <1 page) via:
bun run mirror:refresh # wire to cron / systemd timer / launchd, OR
# set ARXIV_MIRROR_REFRESH_CRON in HTTP mode to schedule in-process
bun run mirror:verify # PRAGMA integrity_check + quick_check
Behavior notes. Ranking divergence: FTS5 BM25 differs from arXiv's internal ranking, so sortBy=relevance against the mirror returns a different top-K than the live API. Queries sorted by submitted descending within ARXIV_MIRROR_RECENT_DAYS_LIVE days route to the live API to cover the nightly-update gap. Refresh resilience: after the initial cold harvest completes, an in-progress or failed daily refresh keeps serving the existing dataset from the mirror — arxiv_search and arxiv_get_metadata don't drop to the live API during the refresh window (#21). The mirror stores the latest version only; per-version reads continue to use the live API. See #12 for the full design.
Docker
docker build -t arxiv-mcp-server .
docker run -p 3010:3010 arxiv-mcp-server
Project Structure
| Directory | Purpose |
|---|---|
src/mcp-server/tools/definitions/ | Tool definitions (*.tool.ts). |
src/mcp-server/resources/definitions/ | Resource definitions (*.resource.ts). |
src/services/arxiv/ | ArxivService — live arXiv API client (search, metadata, HTML). |
src/services/arxiv/mirror/ | Optional OAI-PMH mirror — harvester, SQLite + FTS5 store, query translator, runner. |
src/config/ | Environment variable parsing and validation with Zod. |
scripts/arxiv-mirror-*.ts | Mirror lifecycle scripts (init, refresh, verify). |
tests/ | Unit and integration tests. |
docs/ | Design document and directory structure. |
Development Guide
See CLAUDE.md for development guidelines and architectural rules. The short version:
- Handlers throw, framework catches — no
try/catchin tool logic - Use
ctx.logfor domain-specific logging - Rate limiting is managed by
ArxivService— don't add per-tool delays - arXiv API returns HTTP 200 for everything — check content-type and response body
Contributing
Issues and pull requests are welcome. Run checks before submitting:
bun run devcheck
bun test
License
Apache-2.0 — see LICENSE for details.