LLM Router

Multi-LLM routing MCP server — route text, image, video, and audio tasks to 20+ providers (OpenAI, Gemini, Perplexity, Anthropic, fal, ElevenLabs, Runway) with automatic complexity-based model selection, budget control, and provider failover.

LLM Router

LLM Router

One MCP server. Every AI model. Smart routing.

Route text, image, video, and audio tasks to 20+ AI providers — automatically picking the best model for the job based on your budget and active profile.

Quick StartHow It WorksProvidersProfilesBudget ControlProvider Setup

Tests MIT License Python MCP Providers PyPI


The Problem

You use Claude Code (or any MCP client). You also have access to GPT-4o, Gemini, Perplexity, DALL-E, Runway, ElevenLabs — but switching between them is manual, slow, and expensive.

LLM Router gives your AI assistant one unified interface to all of them — and it automatically picks the right one based on what you're doing and what you can afford.

You:     "Research the latest AI funding rounds"
Router:  → Perplexity Sonar Pro (search-augmented, best for current facts)

You:     "Generate a hero image for the landing page"
Router:  → Flux Pro via fal.ai (best quality/cost for images)

You:     "Write unit tests for the auth module"
Router:  → Claude Sonnet (top coding model, within budget)

You:     "Create a 5-second product demo clip"
Router:  → Kling 2.0 via fal.ai (best value for short video)

How It Saves You Real Money

Here's the key insight: not every task needs the same model.

When you use Claude Code without a router, every single request — whether it's "what does this function do?" or "redesign this entire architecture" — goes to the same expensive model. That's like hiring a surgeon to change a lightbulb.

LLM Router classifies each task automatically and sends it to the cheapest model that can handle it well:

"What does os.path.join do?"     → Gemini Flash    ($0.000001 — literally free)
"Refactor the auth module"       → Claude Sonnet   ($0.003)
"Design the full system arch"    → Claude Opus     ($0.015)

Task Distribution

Task typeWithout RouterWith RouterSavings
Simple queries (60% of work)Opus — $0.015Haiku/Gemini Flash — $0.000199%
Moderate tasks (30% of work)Opus — $0.015Sonnet — $0.00380%
Complex tasks (10% of work)Opus — $0.015Opus — $0.0150%
Blended monthly estimate~$50/mo~$8–15/mo70–85%

💡 With Ollama: Route simple tasks to a free local model (llama3.2, qwen2.5-coder) and the savings become even more dramatic — those 60% of simple tasks cost $0.

The router pays for itself in the first hour of use.


Quick Start

Option A: PyPI (Recommended)

pip install claude-code-llm-router

Option B: Claude Code Plugin

claude plugin add ypollak2/llm-router

Option C: Manual Install

git clone https://github.com/ypollak2/llm-router.git
cd llm-router
uv sync
./scripts/install.sh    # registers as MCP server in Claude Code

Get Running in 3 Steps

Quick Start

Enable Global Auto-Routing

Make the router evaluate every prompt across all projects:

# From the MCP tool:
llm_setup(action='install_hooks')

# Or from the CLI:
llm-router-install-hooks

This installs hooks + rules to ~/.claude/ so every Claude Code session auto-routes tasks to the optimal model.

Start for free: Google's Gemini API has a free tier with 1M tokens/day — no credit card needed. Groq also offers a generous free tier with ultra-fast inference.

What You Get

  • 24 MCP tools — Smart routing, text, image, video, audio, streaming, setup, quality analytics, usage monitoring, cache management
  • /route skill — Smart task classification and routing in one command
  • Smart classifier — Auto-picks Claude Haiku/Sonnet/Opus based on complexity
  • Prompt classification cache — SHA-256 exact-match LRU cache (1000 entries, 1h TTL) for instant repeat classifications
  • Auto-route hook — Multi-layer UserPromptSubmit classifier: routes every prompt (including codebase questions) through Haiku/Ollama first; heuristic scoring (instant) → Ollama local LLM (free, ~1s) → cheap API (Gemini Flash/GPT-4o-mini, ~$0.0001) → auto fallback. Hooks self-update after pip upgrade — no reinstall needed.
  • Streaming responsesllm_stream tool for long-running tasks, shows output as it arrives
  • Usage auto-refreshPostToolUse hook detects stale Claude subscription data (>15 min) and nudges for refresh
  • Savings awareness — Every 5th routed task, shows estimated Claude API costs and rate limit capacity saved
  • Rate limit detection — Catches 429/rate_limit errors with smart cooldowns (15s for rate limits vs 60s for hard failures)
  • Key validationllm_setup(action='test') validates API keys with minimal LLM calls (~$0.0001 each)
  • Claude subscription monitoring — Live session/weekly usage from claude.ai
  • Codex desktop integration — Route tasks to local OpenAI Codex (free)
  • LLM Orchestrator agent — Autonomous multi-step task decomposition across models

How It Works

Architecture

Architecture

Routing Decision Flow

Routing Flow


Benchmark-Driven Routing

Model chains are ranked using weekly-refreshed data from four authoritative sources, so the router always sends your task to the current best model for that task type.

Current Top Models by Task

Task🥇 Premium🥈 Balanced🥉 Budget
💻 CodeDeepSeek-R1, o3, OpusDeepSeek Chat, GPT-4o, SonnetFlash, DeepSeek, Haiku
🔍 AnalyzeDeepSeek-R1, GPT-4o, SonnetDeepSeek-R1, GPT-4o, Gemini ProFlash, DeepSeek, Haiku
❓ QueryDeepSeek Chat, GPT-4o, Gemini ProDeepSeek Chat, GPT-4o, Gemini ProFlash, DeepSeek, Haiku
✍️ GenerateDeepSeek Chat, GPT-4o, Gemini ProDeepSeek Chat, GPT-4o, Gemini ProFlash, DeepSeek, Haiku
🔎 ResearchPerplexity Pro, Perplexity, GPT-4oPerplexity Pro, Perplexity, GPT-4oPerplexity, Flash, Haiku

Bold = first model tried when Claude quota is high (> 85%) or in subscription mode. Full benchmark data, scoring weights, raw scores, and sources: docs/BENCHMARKS.md 🔄 Updated every Monday via GitHub Actions — distributed to all users on next pip upgrade

How rankings are computed

Arena Hard win-rate  ──┐
Aider code pass rate ──┼── weighted by task type ──► quality score ──► quality-cost tier sort
HuggingFace MMLU/MATH──┤                                                ↓
LiteLLM pricing     ──┘                             within 5% quality band → cheapest model first

Quality-cost sorting: models within 5% quality of each other are grouped into a tier. Within that tier, the cheapest model sorts first. This means GPT-4o ($0.006/1K) leads over Sonnet ($0.009/1K) when their quality difference is under 5%, and DeepSeek Chat ($0.0007/1K) leads over everyone when it's within the top quality band.


Auto-Route Hook — Every Prompt, Cheaper Model First

The UserPromptSubmit hook intercepts all prompts — not just explicit routing requests — and classifies them before your top-tier model sees them. Simple tasks go straight to Haiku or a local Ollama model; only genuinely complex work escalates.

What gets routed

PromptClassified asModel used
why doesn't the router work?analyze/moderateHaiku
how does benchmarks.py work?query/simpleOllama / Haiku
fix the bug in profiles.pycode/moderateHaiku / Sonnet
implement a distributed cachecode/complexSonnet / Opus
write a blog post about LLMsgenerate/moderateHaiku / Gemini Flash
git status (raw shell command)(skipped — terminal op)

Classification chain (stops at first success)

1. Heuristic scoring    instant, free   → high-confidence patterns route immediately
2. Ollama local LLM     free, ~1s       → catches what heuristics miss
3. Cheap API            ~$0.0001        → Gemini Flash / GPT-4o-mini fallback
4. Query catch-all      instant, free   → any remaining question → Haiku

Self-updating hooks

Hook scripts are versioned (# llm-router-hook-version: N). On every MCP server startup, if the bundled version in the installed package is newer than what's in ~/.claude/hooks/, it's automatically overwritten. Existing users get classification improvements automatically after pip install --upgrade claude-code-llm-router — no need to re-run llm-router-install-hooks.


Smart Routing (Claude Code Models)

Use Claude Code's own models (Haiku/Sonnet/Opus) without extra API keys via the smart classifier:

llm_classify("What is the capital of France?")
→ [S] simple (99%) → haiku

llm_classify("Write a REST API with auth and pagination")
→ [M] moderate (98%) → sonnet

llm_classify("Design a distributed CQRS architecture")
→ [C] complex (85%) → opus

Complexity-First Routing

Complexity drives model selection — this is the real savings mechanism. You don't need opus for "what time is it?" and you don't want haiku for architecture design. Budget pressure is a late safety net, not the primary router.

# In .env
QUALITY_MODE=balanced        # best | balanced | conserve
MIN_MODEL=haiku              # floor: never route below this
Claude UsageEffect
0-85%No downshift — complexity routing handles efficiency
85-95%Downshift by 1 tier + suggest external fallback
95%+Downshift by 2 tiers + recommend external (Codex, OpenAI, Gemini)

Budget pressure comes from real Claude subscription data (session %, weekly %) fetched live from claude.ai. The router also factors in time until session reset — if you're at 90% but the session resets in 5 minutes, no downshift needed.

External Fallback

When Claude quota is tight (85%+), the router ranks available external models:

llm_classify("Design auth architecture")
# -> complex -> sonnet (downshifted from opus)
#    pressure: [========..] 90%
#    >> fallback: codex/gpt-5.4 (free, preserves Claude quota)
  • Codex (local): Free — uses your OpenAI desktop subscription
  • OpenAI API: GPT-4o, o3 (ranked by quality, filtered by budget)
  • Gemini API: gemini-2.5-pro, gemini-2.5-flash

Per-provider budgets via LLM_ROUTER_BUDGET_OPENAI=10.00, LLM_ROUTER_BUDGET_GEMINI=5.00.

Claude Subscription Monitoring

Live usage data from your claude.ai account — no guessing:

+----------------------------------------------------------+
|                Claude Subscription (Live)                |
+----------------------------------------------------------+
|   Session      [====........]  35%  resets in 3h 7m      |
|   Weekly (all) [===.........]  23%  resets Fri 01:00 PM  |
|   Sonnet only  [===.........]  26%  resets Wed 10:00 AM  |
+----------------------------------------------------------+
|   OK 35% pressure -- full model selection                |
+----------------------------------------------------------+

Fetched via Playwright from claude.ai's internal JSON API (same data the settings page uses). One browser_evaluate call, cached in memory for routing decisions.


Providers

Text & Code LLMs

ProviderModelsFree TierBest For
🦙 OllamaAny local modelYes (free forever)Privacy, zero cost, offline use
Google Gemini2.5 Pro, 2.5 FlashYes (1M tokens/day)Generation, long context
GroqLlama 3.3, MixtralYesUltra-fast inference
OpenAIGPT-4o, GPT-4o-mini, o3NoCode, analysis, reasoning
PerplexitySonar, Sonar ProNoResearch, current events
AnthropicClaude Sonnet, HaikuNoNuanced writing, safety
DeepseekV3, ReasonerYes (limited)Cost-effective reasoning
MistralLarge, SmallYes (limited)Multilingual
TogetherLlama 3, CodeLlamaYes (limited)Open-source models
xAIGrok 3NoReal-time information
CohereCommand R+Yes (trial)RAG, enterprise search

🦙 Ollama runs models locally — no API key, no cost, no data sent externally. Full Ollama setup guide →

Image Generation

ProviderModelsBest For
Google GeminiImagen 3High quality, integrated with text models
fal.aiFlux Pro, Flux DevQuality/cost ratio, fast generation
OpenAIDALL-E 3, DALL-E 2Prompt adherence, text in images
Stability AIStable Diffusion 3Fine control, open weights

Video Generation

ProviderModelsBest For
Google GeminiVeo 2Integrated with Gemini ecosystem
RunwayGen-3 AlphaProfessional quality, motion control
fal.aiKling, minimaxValue, fast generation
ReplicateVariousOpen-source video models

Audio & Voice

ProviderModelsBest For
ElevenLabsMultilingual v2Voice cloning, highest quality
OpenAITTS-1, TTS-1-HDCost-effective text-to-speech

20+ providers and growing. See docs/PROVIDERS.md for full setup guides with API key links.


MCP Tools

Once installed, Claude Code gets these 25 tools:

ToolWhat It Does
Smart Routing
llm_classifyClassify complexity + recommend model with time-aware budget pressure
llm_routeAuto-classify, then route to the best external LLM
llm_track_usageReport Claude Code token usage for budget tracking
Text & Code
llm_queryGeneral questions — auto-routed to the best text LLM
llm_researchSearch-augmented answers via Perplexity
llm_generateCreative content — writing, summaries, brainstorming
llm_analyzeDeep reasoning — analysis, debugging, problem decomposition
llm_codeCoding tasks — generation, refactoring, algorithms
llm_editRoute code-edit reasoning to a cheap model → returns exact {file, old_string, new_string} pairs for Claude to apply
Media
llm_imageImage generation — Gemini Imagen, DALL-E, Flux, or SD
llm_videoVideo generation — Gemini Veo, Runway, Kling, etc.
llm_audioVoice/audio — TTS via ElevenLabs or OpenAI
Orchestration
llm_orchestrateMulti-step pipelines across multiple models
llm_pipeline_templatesList available orchestration templates
Cache
llm_cache_statsView cache hit rate, entries, memory estimate, evictions
llm_cache_clearClear the classification cache
Streaming
llm_streamStream LLM responses for long-running tasks — output as it arrives
Monitoring & Setup
llm_check_usageCheck live Claude subscription usage (session %, weekly %)
llm_update_usageFeed live usage data from claude.ai into the router
llm_codexRoute tasks to local Codex desktop agent (free, uses OpenAI sub)
llm_setupDiscover API keys, add providers, get setup guides, validate keys, install global hooks
llm_quality_reportRouting accuracy, classifier stats, savings metrics, downshift rate
llm_set_profileSwitch routing profile (budget / balanced / premium)
llm_usageUnified dashboard — Claude sub, Codex, APIs, savings in one view
llm_healthCheck provider availability and circuit breaker status
llm_providersList all supported and configured providers
Session Memory
llm_save_sessionSummarize + persist current session for cross-session context injection

Context injection: text tools (llm_query, llm_research, llm_generate, llm_analyze, llm_code) automatically prepend recent conversation history and previous session summaries to every external LLM call — so GPT-4o, Gemini, and Perplexity receive the same context you have. Pass context="..." to add caller-supplied context on top. Controlled by LLM_ROUTER_CONTEXT_ENABLED (default: on).


Routing Profiles

Routing Profiles

Three built-in profiles map to task complexity. Model order is pressure-aware — the router dynamically reorders chains based on live Claude subscription usage.

Budget (simple)Balanced (medium)Premium (complex)
TextOllama → Haiku → cheapSonnet → DeepSeek → GPT-4oOpus → Sonnet → o3
ResearchPerplexity SonarPerplexity Sonar ProPerplexity Sonar Pro
CodeOllama → Haiku → DeepSeekSonnet → DeepSeek → GPT-4oOpus → Sonnet → DeepSeek-R1 → o3
ImageFlux Dev, Imagen FastFlux Pro, Imagen 3, DALL-E 3Imagen 3, DALL-E 3
Videominimax, Veo 2Kling, Veo 2, Runway TurboVeo 2, Runway Gen-3
AudioOpenAI TTSElevenLabsElevenLabs

Quota-aware chain reordering

Claude Pro/Max tokens are treated as free — the router uses them first. As quota is consumed, chains automatically reorder to preserve remaining Claude budget:

Claude usageChain order
0–84%Claude first (free under subscription)
85–98%DeepSeek/Codex → cheap externals → Claude last
≥ 99% (hard cap)DeepSeek → Codex → cheap → paid — zero Claude
Research (any)Perplexity always first (web-grounded)

Claude Code subscription mode

If you use Claude Code (Pro/Max), set LLM_ROUTER_CLAUDE_SUBSCRIPTION=true in .env. The router will never route to Anthropic via API — you're already on Claude, so API routing would require a separate key and add duplicate billing. Instead, every task routes to the best non-Claude alternative:

# In .env
LLM_ROUTER_CLAUDE_SUBSCRIPTION=true   # no ANTHROPIC_API_KEY needed

At normal quota (< 85%), chains lead with the highest-quality available model. At high quota (> 85%), DeepSeek takes over — quality 1.0 benchmark score at ~1/8th the cost of GPT-4o:

Low quota (< 85%)High quota (> 85%)
BUDGET/CODEDeepSeek ChatDeepSeek Chat
BALANCED/CODEDeepSeek ChatDeepSeek Chat
BALANCED/ANALYZEDeepSeek ReasonerDeepSeek Reasoner
PREMIUM/CODEo3DeepSeek Reasoner
PREMIUM/ANALYZEDeepSeek ReasonerDeepSeek Reasoner

Switch profile anytime:

llm_set_profile("budget")    # Development, drafts, exploration
llm_set_profile("balanced")  # Production work, client deliverables
llm_set_profile("premium")   # Critical tasks, maximum quality

Budget Control

Set a monthly budget to prevent overspending:

# In .env
LLM_ROUTER_MONTHLY_BUDGET=50   # USD, 0 = unlimited

The router:

  • Tracks real-time spend across all providers in SQLite
  • Blocks requests when the monthly budget is reached
  • Shows budget status in llm_usage
llm_usage("month")

## Usage Summary (month)
Calls: 142
Tokens: 240,000 in + 80,000 out = 320,000 total
Cost: $3.4200
Avg latency: 1200ms

### Budget Status
Monthly budget: $50.00
Spent this month: $3.4200 (6.8%)
Remaining: $46.5800

Multi-Step Orchestration

Chain tasks across different models in a pipeline:

Orchestration Pipeline

llm_orchestrate("Research AI trends and write a report", template="research_report")

Built-in templates:

TemplateStepsPipeline
research_report3Research → Analyze → Write
competitive_analysis4Multi-source research → SWOT → Report
content_pipeline4Research → Draft → Review → Polish
code_review_fix3Review → Fix → Test

Configuration

Environment Variables

# Required: at least one provider
GEMINI_API_KEY=AIza...         # Free tier! https://aistudio.google.com/apikey
OPENAI_API_KEY=sk-proj-...
PERPLEXITY_API_KEY=pplx-...

# Optional: more providers (add as many as you want)
ANTHROPIC_API_KEY=sk-ant-...
DEEPSEEK_API_KEY=...
GROQ_API_KEY=gsk_...
FAL_KEY=...
ELEVENLABS_API_KEY=...

# Router config
LLM_ROUTER_PROFILE=balanced        # budget | balanced | premium
LLM_ROUTER_MONTHLY_BUDGET=0        # USD, 0 = unlimited
LLM_ROUTER_CLAUDE_SUBSCRIPTION=false  # true = you're a Claude Code Pro/Max user;
                                       # anthropic/* excluded, router uses non-Claude models

# Smart routing (Claude Code model selection)
DAILY_TOKEN_BUDGET=0               # tokens/day, 0 = unlimited
QUALITY_MODE=balanced              # best | balanced | conserve
MIN_MODEL=haiku                    # floor: haiku | sonnet | opus

See .env.example for the full list of supported providers.

Claude Code Integration

After running ./scripts/install.sh, your ~/.claude.json will include:

{
  "mcpServers": {
    "llm-router": {
      "command": "uv",
      "args": ["run", "--directory", "/path/to/llm-router", "llm-router"]
    }
  }
}

Development

# Install with dev dependencies
uv sync --extra dev

# Run tests
uv run pytest -v

# Run integration tests (requires real API keys)
uv run pytest tests/test_integration.py -v

# Lint
uv run ruff check src/

Roadmap

See ROADMAP.md for the detailed roadmap with phases and priorities.

Completed (v0.1–v0.5)

  • Core text LLM routing (10+ providers)
  • Configurable profiles (budget / balanced / premium)
  • Cost tracking with SQLite
  • Health checks with circuit breaker
  • Image generation (Gemini Imagen 3, DALL-E, Flux, SD)
  • Video generation (Gemini Veo 2, Runway, Kling, minimax)
  • Audio/voice routing (ElevenLabs, OpenAI TTS)
  • Monthly budget enforcement
  • Multi-step orchestration with pipeline templates
  • Claude Code plugin with orchestrator agent and /route skill
  • Freemium tier gating
  • CI with GitHub Actions
  • Smart complexity-first routing (simple->haiku, moderate->sonnet, complex->opus)
  • Live Claude subscription monitoring (session %, weekly %, Sonnet %)
  • Time-aware budget pressure (factors in session reset proximity)
  • External fallback ranking when Claude is tight (Codex, OpenAI, Gemini)
  • Codex desktop integration (local agent, free via OpenAI subscription)
  • Unified usage dashboard (Claude sub + Codex + APIs + savings)
  • llm_setup tool for API discovery and secure key management
  • Per-provider budget limits
  • ASCII box-drawing dashboard (terminal-friendly, no Unicode issues)
  • Prompt classification cache (SHA-256 exact-match, in-memory LRU, 1h TTL)
  • llm_cache_stats + llm_cache_clear MCP tools
  • Auto-route hook (UserPromptSubmit heuristic classifier, zero-latency)
  • Rate limit detection with smart cooldowns (15s rate limit vs 60s hard failure)
  • llm_setup(action='test') — API key validation with minimal LLM calls
  • Streaming responses (llm_stream tool + call_llm_stream() async generator)
  • Usage auto-refresh hook (PostToolUse staleness detection + usage pulse wiring)
  • Published to PyPI as claude-code-llm-router
  • Multi-layer auto-classification: scoring heuristic → Ollama local LLM (qwen3.5) → cheap API (Gemini Flash/GPT-4o-mini)
  • Savings awareness (PostToolUse hook tracks routed calls, periodic cost savings reminders)
  • Structural context compaction (5 strategies: whitespace, comments, dedup, truncation, stack traces)
  • Quality logging (routing_decisions table + llm_quality_report tool)
  • Savings persistence (JSONL + SQLite import, lifetime analytics)
  • Gemini media APIs (Imagen 3 images, Veo 2 video)
  • Global hook installer (llm_setup(action='install_hooks') + llm-router-install-hooks CLI)
  • Global routing rules (auto-installed to ~/.claude/rules/llm-router.md)
  • Session context injection (ring buffer + SQLite summaries, injected into all text tools)
  • llm_save_session MCP tool (auto-summarize + persist session for future context)
  • Cross-session memory (previous session summaries prepended to external LLM calls)
  • Auto-update routing rules (version header + silent update on MCP startup after pip upgrade)
  • Token arbitrage enforcement — routing hint override bug fixed; simple tasks now correctly route to cheap models
  • Claude Code subscription mode (LLM_ROUTER_CLAUDE_SUBSCRIPTION) — exclude Anthropic from chains; route to DeepSeek/Gemini/GPT-4o instead
  • Quality-cost tier sorting — within 5% quality band, prefer cheaper model (GPT-4o over Sonnet, DeepSeek over everyone when near-equal quality)
  • DeepSeek Reasoner in cheap tier — $0.0014/1K leads at >85% pressure (was treated as "paid" tier alongside o3 at $0.025)
  • Codex injection fix — no longer injected at position 0 when subscription mode removes Claude from chain (caused 300s timeouts)
  • Codex task filtering — excluded from RESEARCH (no web access) and QUERY (too slow) chains

Completed (v0.7)

  • Availability-aware routing — P95 latency from routing_decisions table folded into benchmark quality score. Penalty range 0.0–0.50 (<5s=0, <15s=0.03, <60s=0.10, <180s=0.30, ≥180s=0.50). 60s cache prevents repeated DB hits per routing cycle.
  • Codex cold-start defaults_COLD_START_LATENCY_MS applies pessimistic 60-90s P95 before any history exists, preventing Codex from being placed first in chains on a fresh install.
  • llm_edit MCP tool — Routes code-edit reasoning to a cheap CODE model. Reads files locally (32 KB cap), gets {file, old_string, new_string} JSON back, returns formatted instructions for Claude to apply mechanically. Keeps Opus out of the "what to change" loop.

Next Up (v0.8 — Evaluation & Learning)

  • Classification outcome tracking (was the routed model's response good?)
  • A/B testing framework for routing decisions
  • Adaptive routing based on historical success rates

Contributing

We welcome contributions! See CONTRIBUTING.md for guidelines.

Key areas where help is needed:

  • Adding new provider integrations
  • Improving routing intelligence
  • Testing across different MCP clients
  • Documentation and examples

License

MIT — use it however you want.


Built with LiteLLM and MCP

Serveurs connexes