arxiv-mcp-server

arXiv paper search and full-text reading

@cyanheads/arxiv-mcp-server

Search arXiv, fetch paper metadata, and read full-text content via MCP. STDIO or Streamable HTTP.

4 Tools • 2 Resources

Version License Docker MCP SDK npm TypeScript

Install in Claude Desktop Install in Cursor Install in VS Code

Framework

Public Hosted Server: https://arxiv.caseyjhand.com/mcp


Tools

Four tools for searching and reading arXiv papers:

Tool NameDescription
arxiv_searchSearch arXiv papers by query with category and sort filters.
arxiv_get_metadataGet full metadata for one or more arXiv papers by ID.
arxiv_read_paperFetch the full text content of an arXiv paper from its HTML rendering.
arxiv_list_categoriesList arXiv category taxonomy, optionally filtered by group.

arxiv_search

Search for papers using free-text queries with field prefixes and boolean operators.

  • Field prefixes: ti: (title), au: (author), abs: (abstract), cat: (category), all: (all fields)
  • Boolean operators: AND, OR, ANDNOT
  • Optional category filter, sorting (relevance, submitted, updated), and pagination
  • Returns up to 50 results per request with full metadata including abstract

arxiv_get_metadata

Fetch full metadata for one or more papers by known arXiv ID.

  • Batch fetch up to 10 papers in a single request
  • Accepts both versioned (2401.12345v2) and unversioned (2401.12345) IDs
  • Legacy ID format supported (hep-th/9901001)
  • Reports not-found IDs separately from found papers

arxiv_read_paper

Read the full HTML content of an arXiv paper.

  • Tries native arXiv HTML first, falls back to ar5iv for broader coverage
  • Strips HTML head/boilerplate and collapses MathML to dollar-delimited LaTeX ($…$ inline, $$…$$ block) so the character budget targets paper content
  • max_characters defaults to 100,000; raw HTML can be 500KB-3MB+ for math-heavy papers
  • Returns raw HTML — no parsing or extraction; the LLM interprets content directly

arxiv_list_categories

List arXiv category codes and names for discovery.

  • ~155 categories across 8 top-level groups (cs, math, physics, q-bio, q-fin, stat, eess, econ)
  • Optional group filter to narrow results
  • Static data — always succeeds

Resources

URI PatternDescription
arxiv://paper/{paperId}Paper metadata by arXiv ID.
arxiv://categoriesFull arXiv category taxonomy.

Features

Built on @cyanheads/mcp-ts-core:

  • Declarative tool definitions — single file per tool, framework handles registration and validation
  • Unified error handling across all tools
  • Pluggable auth (none, jwt, oauth)
  • Structured logging with optional OpenTelemetry tracing
  • Runs locally (stdio/HTTP) from the same codebase

arXiv-specific:

  • Read-only, no authentication required — arXiv API is free, metadata is CC0
  • Rate-limited request queue enforcing arXiv's 3-second crawl delay
  • Adaptive cooldown on rate-limit (5s → 10s → 20s → 30s), honors Retry-After
  • Retry with exponential backoff for transient failures
  • HTML content fallback chain: native arXiv HTML → ar5iv
  • Full arXiv category taxonomy embedded as static data
  • Optional local OAI-PMH metadata mirror (SQLite + FTS5) — opt-in, eliminates rate-limit exposure for arxiv_search and arxiv_get_metadata. See Optional: Local Mirror.

Getting Started

Public Hosted Instance

A public instance is available at https://arxiv.caseyjhand.com/mcp — no installation required. Point any MCP client at it via Streamable HTTP:

{
  "mcpServers": {
    "arxiv": {
      "type": "streamable-http",
      "url": "https://arxiv.caseyjhand.com/mcp"
    }
  }
}

Self-Hosted / Local

Add to your MCP client config (e.g., claude_desktop_config.json):

{
  "mcpServers": {
    "arxiv": {
      "type": "stdio",
      "command": "bunx",
      "args": ["@cyanheads/arxiv-mcp-server@latest"]
    }
  }
}

Prerequisites

Installation

  1. Clone the repository:
git clone https://github.com/cyanheads/arxiv-mcp-server.git
  1. Navigate into the directory:
cd arxiv-mcp-server
  1. Install dependencies:
bun install

Configuration

All configuration is optional — the server works out of the box with sensible defaults.

VariableDescriptionDefault
ARXIV_API_BASE_URLarXiv API base URL.https://export.arxiv.org/api
ARXIV_REQUEST_DELAY_MSMinimum delay between arXiv API requests (ms).3000
ARXIV_CONTENT_TIMEOUT_MSTimeout for HTML content fetches (ms).30000
ARXIV_API_TIMEOUT_MSTimeout for API search/metadata requests (ms).15000
ARXIV_MIRROR_ENABLEDEnable local OAI-PMH metadata mirror for search and metadata.false
ARXIV_MIRROR_PATHSQLite path for the mirror../data/arxiv-mirror.db
ARXIV_MIRROR_REFRESH_CRONUTC cron expression for in-process daily refresh (HTTP mode only).unset
ARXIV_MIRROR_FALLBACK_LIVEFall through to live API on local ID-lookup miss.true
ARXIV_MIRROR_RECENT_DAYS_LIVERoute sortBy=submitted descending queries within this window to the live API.2
ARXIV_MIRROR_OAI_BASE_URLarXiv OAI-PMH endpoint base URL.https://oaipmh.arxiv.org/oai
ARXIV_MIRROR_OAI_REQUEST_DELAY_MSMinimum delay between OAI-PMH requests (ms).3000
MCP_TRANSPORT_TYPETransport: stdio or http.stdio
MCP_HTTP_PORTPort for HTTP server.3010
MCP_AUTH_MODEAuth mode: none, jwt, or oauth.none
MCP_LOG_LEVELLog level (RFC 5424).info

Running the Server

Local Development

  • Build and run:

    bun run build
    bun run start:http   # or start:stdio
    
  • Run checks and tests:

    bun run devcheck     # Lint, format, typecheck, audit
    bun run test         # Vitest
    

Optional: Local Mirror

For self-hosted deployments behind a single egress IP, arXiv's ~3-second per-IP crawl delay serializes concurrent users. An optional local mirror eliminates rate-limit exposure for arxiv_search and arxiv_get_metadata by serving from a SQLite + FTS5 store harvested via OAI-PMH. arxiv_read_paper continues to use the live API — full-content harvest is forbidden by arXiv's data policy.

Disabled by default. To enable:

# 1. Cold-start harvest (~4.4h sequential, resumable from checkpoint). One-time per installation.
bun run mirror:init

# 2. Enable the mirror.
export ARXIV_MIRROR_ENABLED=true

# 3. Start the server — reads switch to the mirror once the harvest completes.
bun run start:http

Daily incremental refresh (~30s, <1 page) via:

bun run mirror:refresh   # wire to cron / systemd timer / launchd, OR
                         # set ARXIV_MIRROR_REFRESH_CRON in HTTP mode to schedule in-process
bun run mirror:verify    # PRAGMA integrity_check + quick_check

Behavior notes. Ranking divergence: FTS5 BM25 differs from arXiv's internal ranking, so sortBy=relevance against the mirror returns a different top-K than the live API. Queries sorted by submitted descending within ARXIV_MIRROR_RECENT_DAYS_LIVE days route to the live API to cover the nightly-update gap. The mirror stores the latest version only; per-version reads continue to use the live API. See #12 for the full design.

Docker

docker build -t arxiv-mcp-server .
docker run -p 3010:3010 arxiv-mcp-server

Project Structure

DirectoryPurpose
src/mcp-server/tools/definitions/Tool definitions (*.tool.ts).
src/mcp-server/resources/definitions/Resource definitions (*.resource.ts).
src/services/arxiv/ArxivService — live arXiv API client (search, metadata, HTML).
src/services/arxiv/mirror/Optional OAI-PMH mirror — harvester, SQLite + FTS5 store, query translator, runner.
src/config/Environment variable parsing and validation with Zod.
scripts/arxiv-mirror-*.tsMirror lifecycle scripts (init, refresh, verify).
tests/Unit and integration tests.
docs/Design document and directory structure.

Development Guide

See CLAUDE.md for development guidelines and architectural rules. The short version:

  • Handlers throw, framework catches — no try/catch in tool logic
  • Use ctx.log for domain-specific logging
  • Rate limiting is managed by ArxivService — don't add per-tool delays
  • arXiv API returns HTTP 200 for everything — check content-type and response body

Contributing

Issues and pull requests are welcome. Run checks before submitting:

bun run devcheck
bun test

License

Apache-2.0 — see LICENSE for details.

関連サーバー