LLM Router

Multi-LLM routing MCP server — route text, image, video, and audio tasks to 20+ providers (OpenAI, Gemini, Perplexity, Anthropic, fal, ElevenLabs, Runway) with automatic complexity-based model selection, budget control, and provider failover.

GitHub

LLM Router animated hero — route every AI call through a moving complexity pipeline into free, budget, and premium model tiers across 20+ providers, 60 MCP tools, and 60-80% savings.

LLM Router

A local control plane for AI coding tools.
Routes tasks to the cheapest model that can do the job well.
Protects quota. Enforces policy. Tracks spend. Falls back on failure.

Why This Exists

AI coding assistants route every task — simple questions, complex architecture — to the same expensive model. You pay full price for work that a cheaper model handles equally well.

llm-router sits between your AI tool and the LLM providers. It classifies each task by complexity, picks the cheapest capable model, and falls back through a provider chain on failure. You don't change your workflow. The router handles model selection automatically.

Use this if:

You use Claude Code, Codex CLI, Gemini CLI, or Pi and want to reduce spend
You want automatic fallback when a provider is down or rate-limited
You want local Ollama models tried first (free) before paid APIs
You want visibility into token spend across providers

Don't use this if:

You always want the best possible model regardless of cost
You don't use MCP-compatible tools
You need guaranteed latency (routing adds classification overhead)

Animated benefits panel for llm-router showing cheaper routing, preserved quality, quota protection, and low-config setup.

Quick Start

1. Install

pip install llm-routing
llm-router install

Package name: llm-routing on PyPI. CLI command: llm-router.

2. Add providers (optional)

export OPENAI_API_KEY="sk-..."      # GPT-4o, o3
export GEMINI_API_KEY="AIza..."     # Gemini Flash/Pro (free tier available)
export OLLAMA_BASE_URL="http://localhost:11434"  # Local models (free)

Works with zero API keys on Claude Code Pro/Max subscriptions — routing uses MCP tools that call external models only when beneficial.

3. Verify

llm-router install --check   # Preview what will be installed
llm-router health            # Check provider connectivity

In Claude Code, ask a simple question. The session-end summary shows routing decisions and savings.

How It Works

User prompt
    │
    ▼
┌──────────────────────┐
│ Complexity Classifier │  ← Heuristic (free, instant) or Ollama/Flash ($0.0001)
└──────────┬───────────┘
           │
           ▼
┌──────────────────────┐
│  Free-First Router   │  ← Tries cheapest model first, walks up the chain
│                      │
│  Ollama (free)       │
│  → Codex (prepaid)   │
│  → Gemini Flash      │
│  → GPT-4o / Claude   │
└──────────┬───────────┘
           │
           ▼
┌──────────────────────┐
│  Guards (parallel)   │  ← Circuit breaker, budget pressure, quality check
└──────────┬───────────┘
           │
           ▼
      Response + cost logged to local SQLite

Routing examples

Task	Complexity	Chain
"What does this error mean?"	Simple	Ollama → Codex → Gemini Flash → Groq
"Implement OAuth"	Moderate	Ollama → Codex → GPT-4o → Gemini Pro
"Design distributed tracing"	Complex	Ollama → Codex → o3 → Claude Opus

Classification is free (regex heuristics catch ~70% of tasks) or near-free (local Ollama / Gemini Flash for ambiguous cases).

Host Support

Host	Auto-Routing	MCP Tools	Savings Potential
Claude Code	Full (hooks)	60 tools	60–80%
Codex CLI	Full (hooks)	60 tools	60–80%
Gemini CLI	Full (hooks)	60 tools	50–70%
VS Code / Cursor	Manual	60 tools	30–50%
Any MCP client	Manual	60 tools	Varies

Animated host support cards for Claude Code, Codex CLI, Gemini CLI, Pi, VS Code, Cursor, and any MCP client.

Full = hooks intercept prompts and route automatically. No workflow change needed. Manual = MCP tools are available; you invoke them explicitly (e.g., call llm_query).

llm-router install                    # Claude Code (default)
llm-router install --host codex       # Codex CLI
llm-router install --host gemini-cli  # Gemini CLI
llm-router install --host vscode      # VS Code
llm-router install --host cursor      # Cursor

See docs/HOST_SUPPORT_MATRIX.md for full details on each host.

What You Can Do

Use case	How
Route simple questions to free local models	Auto (hooks) or `llm_query`
Protect Claude subscription quota	Budget pressure monitoring + auto-downgrade
Fall back across providers on failure	Automatic chain with circuit breakers
Track token spend and savings	`llm_usage`, `llm_savings`, session-end reports
Enforce routing policy for your team	`LLM_ROUTER_POLICY=aggressive`
Generate images/video/audio	`llm_image`, `llm_video`, `llm_audio`
Run multi-step research pipelines	`llm_orchestrate` with templates
Bulk-edit files with cheap models	`llm_fs_edit_many`

Providers

Routing chains are built from your configured providers. You only need one.

Text LLM Providers

Provider	Models	Cost	Setup
Ollama	gemma4, qwen3.5, llama3, etc.	Free (local)	`OLLAMA_BASE_URL`
OpenAI	GPT-4o, o3, GPT-4o-mini	Paid API	`OPENAI_API_KEY`
Google	Gemini Flash, Pro	Free tier + paid	`GEMINI_API_KEY`
Anthropic	Claude Sonnet, Opus, Haiku	Paid API or subscription	`ANTHROPIC_API_KEY` or subscription
xAI	Grok-3	Paid API	`XAI_API_KEY`
DeepSeek	DeepSeek Chat, Reasoner	Paid API (ultra-cheap)	`DEEPSEEK_API_KEY`
Mistral	Mistral Large, Small	Paid API	`MISTRAL_API_KEY`
Cohere	Command R+	Paid API	`COHERE_API_KEY`
Perplexity	Sonar Pro (web-grounded)	Paid API	`PERPLEXITY_API_KEY`
Groq	Fast inference (Llama, Mixtral)	Free tier	`GROQ_API_KEY`
Together	Open-source models	Paid API	`TOGETHER_API_KEY`
HuggingFace	Open-source models	Free tier + paid	`HF_TOKEN`
Codex	GPT-5.4, o3 (prepaid desktop)	Included with Codex CLI	Auto-detected

Media Providers

Provider	Type	Setup
fal	Image (Flux), Video (Kling)	`FAL_KEY`
Stability	Image (Stable Diffusion 3)	`STABILITY_API_KEY`
ElevenLabs	Audio / TTS	`ELEVENLABS_API_KEY`
Runway	Video (Gen-3)	`RUNWAY_API_KEY`
Replicate	Various open-source models	`REPLICATE_API_TOKEN`

See docs/PROVIDERS.md for setup instructions and model recommendations.

Routing Policies

Control how aggressively the router offloads to cheap models.

Policy	Confidence Threshold	Typical Savings	Best For
Aggressive	2	60–75%	Maximum cost reduction
Balanced (default)	4	35–45%	Cost/quality tradeoff
Conservative	6	10–15%	Quality over cost

export LLM_ROUTER_POLICY=aggressive     # Or: balanced, conservative
export LLM_ROUTER_ENFORCE=smart          # smart | hard | soft | off
export LLM_ROUTER_PROFILE=balanced       # budget | balanced | premium

LLM_ROUTER_ENFORCE controls how strictly the auto-route hook blocks direct model use:

smart — route when confident, pass through when uncertain
hard — always route, block unrouted tool calls
soft — suggest routing, never block
off — disable hook enforcement

MCP Tools (60)

llm-router exposes 60 MCP tools organized by function:

Category	Tools	Examples
Routing & classification	7	`llm_route`, `llm_classify`, `llm_auto`, `llm_stream`
Text generation	6	`llm_query`, `llm_code`, `llm_analyze`, `llm_research`
Media generation	3	`llm_image`, `llm_video`, `llm_audio`
Pipeline orchestration	2	`llm_orchestrate`, `llm_pipeline_templates`
Admin & monitoring	20+	`llm_usage`, `llm_budget`, `llm_health`, `llm_savings`
Filesystem operations	4	`llm_fs_find`, `llm_fs_edit_many`
Subscription tracking	3	`llm_check_usage`, `llm_refresh_claude_usage`

Slim mode (LLM_ROUTER_SLIM=routing or core) reduces registered tools to save context tokens in constrained environments.

Full Tool Reference

Savings: How It Works

Animated savings breakdown showing 60-80% typical cost reduction with token distribution across free, budget, and premium tiers.

Savings are calculated by comparing actual spend against a baseline of routing every task to Claude Sonnet/Opus.

Methodology:

Each routed task logs: model used, tokens consumed, estimated cost
A baseline cost is computed as if the same tokens were processed by the most expensive model in the chain
Savings = (baseline - actual) / baseline

Assumptions and limitations:

Baseline assumes you would have used Opus/Sonnet for everything (worst case)
Token estimates use len(text) / 4 approximation, not exact tokenizer counts
Cost data comes from LiteLLM's pricing tables (may lag provider price changes)
Savings vary significantly by workload — code-heavy sessions route more to cheap models
The router itself adds small overhead (classification costs ~$0.0001 per ambiguous task)

Observed range: 35–80% savings depending on policy and task mix. The "87%" figure in some docs represents a single-user peak over a specific development period, not a guaranteed outcome.

Trust, Privacy, and Local-First Design

llm-router runs entirely on your machine. There is no hosted proxy, no telemetry, no account required.

What	Where	Details
Your prompts	Sent to configured providers	Exactly like using those providers directly
API keys	`.env` or `~/.llm-router/config.yaml`	Local files, never transmitted
Usage logs	`~/.llm-router/usage.db`	Unencrypted SQLite (filesystem permissions)
Classification cache	In-memory	Cleared on process restart
Hook scripts	`~/.claude/hooks/`	Local shell scripts, inspectable

What we do:

Scrub API keys from structured logs
Detect hook deadlocks before installation
Store all data locally in ~/.llm-router/
Respect provider rate limits and TOS

What you should know:

Prompts are sent to whichever provider the router selects — review your provider's privacy policy
Usage logs (SQLite) are not encrypted at rest — use full-disk encryption if needed
The router cannot prevent model jailbreaks or prompt injection at the provider level

See SECURITY.md for responsible disclosure policy and docs/SECURITY_DESIGN.md for the full threat model.

Configuration

Minimal setup — only configure what you have:

# Provider keys (set any combination)
export OPENAI_API_KEY="sk-proj-..."
export GEMINI_API_KEY="AIza..."
export OLLAMA_BASE_URL="http://localhost:11434"
export OLLAMA_BUDGET_MODELS="gemma4:latest,qwen3.5:latest"

# Routing behavior
export LLM_ROUTER_PROFILE="balanced"       # budget | balanced | premium
export LLM_ROUTER_POLICY="balanced"        # aggressive | balanced | conservative
export LLM_ROUTER_ENFORCE="smart"          # smart | hard | soft | off

For teams or environments where .env is restricted:

# User-level config (no project .env needed)
mkdir -p ~/.llm-router && chmod 700 ~/.llm-router
cat > ~/.llm-router/config.yaml << 'EOF'
openai_api_key: "sk-proj-..."
gemini_api_key: "AIza..."
ollama_base_url: "http://localhost:11434"
llm_router_profile: "balanced"
EOF
chmod 600 ~/.llm-router/config.yaml

Documentation

Document	Purpose
Quick Start (2 min)	Fastest path to working routing
Getting Started	Full setup walkthrough
Host Support Matrix	Per-host feature comparison
Providers	Provider setup and model recommendations
Tool Reference	All 60 MCP tools with examples
Architecture	Internal design and module structure
Troubleshooting	Common issues and fixes
Security Design	Threat model and data handling

Contributing

Contributions welcome. See CONTRIBUTING.md for full guidelines.

git clone https://github.com/ypollak2/llm-router.git
cd llm-router
uv sync --extra dev
uv run pytest tests/ -q         # Run tests (1700+)
uv run ruff check src/ tests/   # Lint

Package Names

Name	What it is
`llm-routing`	Current PyPI package (`pip install llm-routing`)
`llm-router`	CLI command and GitHub repo name
`claude-code-llm-router`	Deprecated legacy package (redirects to `llm-routing`)

Issues · PyPI · Changelog

_{MIT License}

เซิร์ฟเวอร์ที่เกี่ยวข้อง

Nexus Dashboard

A comprehensive Model Context Protocol (MCP) server for Cisco Nexus Dashboard, enabling AI agents like Claude to interact with Nexus Dashboard APIs for intelligent network automation and management.

OpenFoodTox Food Chemical Hazards

MCP server providing tools to access EFSA's comprehensive OpenFoodTox Chemical Hazards in food dataset

Autopsy

Allows access to DFIR / forensics data that was analyzed by the open source Autopsy platform

Context-Fabric

Corpus search and linguistic analysis for AI Agents

Memento-cmp

A Three-Layer Memory Architecture for LLMs (Redis + Postgres + Vector) MCP

Janee API Security

MCP server that sits between AI agents and APIs. Agents request access, Janee makes the call with the real credentials, agents never see the secrets.

FlashAlpha

Options Analytics API - GEX Exposure Greeks Volatility

MCP 3D Printer Server

Connects to 3D printer management systems like OctoPrint, Klipper, and Bambu Labs for model manipulation and printing workflows.

Job Ad Intelligence MCP

A paid MCP server that helps AI agents analyse job advertisements. Five tools: extract structured data from any job ad (text or URL), normalise salary strings into min/max/currency/period, detect seniority level from job titles, score a CV against a job ad, and generate targeted application questions. Priced from $0.002 to $0.05 per call, paid in USDC on Base via x402. No API key required.

Chessmata

3D graphical chess game for humans and agents

LLM Router

LLM Router

Why This Exists

Quick Start

1. Install

2. Add providers (optional)

3. Verify

How It Works

Routing examples

Host Support

What You Can Do

Providers

Text LLM Providers

Media Providers

Routing Policies

MCP Tools (60)

Savings: How It Works

Trust, Privacy, and Local-First Design

Configuration

Documentation

Contributing

Package Names

เซิร์ฟเวอร์ที่เกี่ยวข้อง

Nexus Dashboard

OpenFoodTox Food Chemical Hazards

Autopsy

Context-Fabric

Memento-cmp

Janee API Security

FlashAlpha

MCP 3D Printer Server

Job Ad Intelligence MCP

Chessmata

NotebookLM Web Importer