LLAMA Hot Swap
MCP server for hot-swapping llama.cpp models in Claude Code - launchctl (macOS) + systemd (Linux)
mcp-llama-swap
Hot-swap llama.cpp models inside a running Claude Code session. No context loss. One command.
Plan with a reasoning model. Implement with a coding model. Same session, same context, zero manual overhead.
Supports macOS (launchctl) and Linux (systemd).
Why
Running local LLMs means choosing between a strong reasoning model and a fast coding model. You can't load both on a single machine. Manually swapping models kills your conversation context and flow.
mcp-llama-swap solves this by giving Claude Code a tool to swap the model behind llama-server via your system's service manager (launchctl on macOS, systemd on Linux), while preserving the full conversation history client-side.
Quick Start
Install
# Option A: Run directly with uvx (no install needed)
uvx mcp-llama-swap
# Option B: Install from PyPI
pip install mcp-llama-swap
Configure Claude Code
Add to ~/.claude.json:
{
"mcpServers": {
"llama-swap": {
"command": "uvx",
"args": ["mcp-llama-swap"],
"env": {
"LLAMA_SWAP_CONFIG": "/path/to/config.json"
}
}
}
}
Configure Models
Create config.json (macOS):
{
"plists_dir": "~/.llama-plists",
"health_url": "http://localhost:8000/health",
"health_timeout": 30,
"models": {
"planner": "qwen35-thinking.plist",
"coder": "qwen3-coder.plist",
"fast": "glm-flash.plist"
}
}
Or on Linux:
{
"services_dir": "~/.llama-services",
"health_url": "http://localhost:8000/health",
"health_timeout": 30,
"models": {
"planner": "llama-server-planner.service",
"coder": "llama-server-coder.service"
}
}
Use
Inside Claude Code:
You: list models
You: swap to planner
You: <discuss architecture, define interfaces>
You: swap to coder and implement the plan
That's it. Context is preserved across swaps.
You can also generate new model configs directly:
You: create a model config named "reasoning" for /models/qwen3-30b.gguf with 8192 context
How It Works
Claude Code CLI
|
| Anthropic Messages API
v
LiteLLM Proxy (:4000) <-- translates Anthropic -> OpenAI format
|
| OpenAI Chat Completions API
v
llama-server (:8000) <-- model weights swapped via service manager
^
|
mcp-llama-swap <-- this project (launchctl or systemd)
Claude Code speaks Anthropic format. LiteLLM translates to OpenAI format for llama-server. This MCP server manages which model service is loaded via launchctl (macOS) or systemd (Linux).
Conversation context survives swaps because Claude Code holds the full message history client-side and re-sends it with every request.
Model Configuration
Mapped Mode (recommended)
Define aliases for your models. Only mapped models are available. Other service configs in the directory are ignored.
macOS:
{
"plists_dir": "~/.llama-plists",
"health_url": "http://localhost:8000/health",
"health_timeout": 30,
"models": {
"planner": "qwen35-35b-a3b-thinking.plist",
"coder": "qwen3-coder.plist",
"fast": "glm-4-7-flash.plist"
}
}
Linux:
{
"services_dir": "~/.llama-services",
"health_url": "http://localhost:8000/health",
"health_timeout": 30,
"models": {
"planner": "llama-server-planner.service",
"coder": "llama-server-coder.service"
}
}
Swap using your aliases: "swap to coder", "swap to planner".
Directory Mode
Set "models": {} to auto-discover all service configs. Filenames (without extension) become the aliases.
macOS:
{
"plists_dir": "~/.llama-plists",
"models": {}
}
Linux:
{
"services_dir": "~/.llama-services",
"models": {}
}
MCP Tools
| Tool | Description |
|---|---|
list_models | Lists all configured models with load status and current mode |
get_current_model | Returns the alias of the currently loaded model |
swap_model | Unloads current model, loads the specified one, waits for health check |
create_model_config | Generates a new launchd plist (macOS) or systemd unit (Linux) for a model |
MCP Resources
| Resource | Description |
|---|---|
llama-swap://config | Current configuration as JSON |
llama-swap://status | Current model status, health, and platform info |
MCP Prompts
| Prompt | Description |
|---|---|
swap-workflow | Guided plan-then-implement workflow template |
Full Setup Guide
Prerequisites
- macOS with launchctl, or Linux with systemd
- llama-server (llama.cpp) installed
- Model configurations as service files (launchd plists or systemd units)
- Python 3.10+
- Claude Code CLI pointed at a LiteLLM proxy
1. Install mcp-llama-swap
pip install mcp-llama-swap
2. Install and start LiteLLM proxy
pip install litellm
Create litellm_config.yaml:
model_list:
- model_name: "*"
litellm_params:
model: "openai/*"
api_base: "http://localhost:8000/v1"
api_key: "sk-none"
litellm_settings:
drop_params: true
request_timeout: 300
Start it:
litellm --config litellm_config.yaml --port 4000
On macOS, you can use the included ai.litellm.proxy.plist.template to run it as a persistent launchd service (see setup.sh).
3. Point Claude Code at LiteLLM
Add to ~/.zshrc (macOS) or ~/.bashrc (Linux):
export ANTHROPIC_BASE_URL="http://localhost:4000"
export ANTHROPIC_API_KEY="sk-none"
export ANTHROPIC_MODEL="local"
4. Add MCP server to Claude Code
Add to ~/.claude.json:
{
"mcpServers": {
"llama-swap": {
"command": "uvx",
"args": ["mcp-llama-swap"],
"env": {
"LLAMA_SWAP_CONFIG": "/absolute/path/to/config.json"
}
}
}
}
5. Create your config.json
Copy config.example.json (macOS) or config.example.linux.json (Linux) and edit with your model aliases and service filenames.
6. Create model service configs
You can create service configs manually, or use the create_model_config MCP tool inside Claude Code:
You: create a model config named "coder" for /path/to/model.gguf with 8192 context
This generates the appropriate launchd plist (macOS) or systemd unit file (Linux) in your services directory.
Automated Setup (macOS)
If you prefer a one-shot setup on macOS, clone this repo and run:
git clone https://github.com/oussama-kh/mcp-llama-swap.git ~/mcp-llama-swap
cd ~/mcp-llama-swap
chmod +x setup.sh
./setup.sh
The script creates a virtual environment, installs dependencies, configures the LiteLLM launchd service, and prints the exact config to add.
Configuration Reference
config.json fields:
| Field | Default | Description |
|---|---|---|
services_dir | ~/.llama-plists (macOS) / ~/.llama-services (Linux) | Directory containing model service configs |
plists_dir | — | macOS alias for services_dir (backwards compatible) |
units_dir | — | Linux alias for services_dir |
health_url | http://localhost:8000/health | llama-server health endpoint |
health_timeout | 30 | Seconds to wait for health check after loading |
models | {} | Alias-to-filename map. Empty = directory mode |
platform | auto | Service manager: auto, launchctl, or systemd |
launchctl_mode | legacy | macOS only: legacy (load/unload) or modern (bootstrap/bootout) |
Override config path via the LLAMA_SWAP_CONFIG environment variable.
Platform Details
macOS (launchctl)
Models are managed as launchd services via plist files. Two launchctl modes are available:
- Legacy (default): Uses
launchctl load/unload/list. Works on all macOS versions. - Modern: Uses
launchctl bootstrap/bootout/print. The officially supported API on newer macOS. Enable with"launchctl_mode": "modern"in config.
Linux (systemd)
Models are managed as systemd user services. Unit files in services_dir are symlinked to ~/.config/systemd/user/ and managed via systemctl --user start/stop.
Troubleshooting
LiteLLM not translating correctly: Check /tmp/litellm.stderr.log. Verify llama-server is running: curl http://localhost:8000/health.
Model swap times out: Increase health_timeout in config.json. Large models may need 30+ seconds to load weights into memory.
Claude Code cannot find the MCP server: Verify the LLAMA_SWAP_CONFIG path is absolute. Test directly: python -m mcp_llama_swap.
Mapped model not found: The service filename in models must match an actual file in your services directory.
systemd service won't start: Check journalctl --user -u llama-server-<name> for errors. Ensure llama-server is in your PATH.
launchctl modern mode issues: If bootstrap/bootout commands fail, fall back to "launchctl_mode": "legacy" in config.
Development
# Install with test dependencies
pip install -e ".[test]"
# Run tests
pytest -v
Use Case
This project enables a two-phase AI coding workflow entirely on local hardware:
- Planning phase: Load a reasoning model (e.g., Qwen3.5-35B-A3B with thinking). Discuss architecture, define interfaces, decompose requirements.
- Implementation phase: Swap to a coding model (e.g., Qwen3-Coder-30B). Execute the plan file by file with full conversation context from the planning phase.
No cloud APIs. No data leaving your machine. No context loss between phases.
License
Apache-2.0
Related Servers
Scout Monitoring MCP
sponsorPut performance and error data directly in the hands of your AI assistant.
Alpha Vantage MCP Server
sponsorAccess financial market data: realtime & historical stock, ETF, options, forex, crypto, commodities, fundamentals, technical indicators, & more
Elementor WordPress MCP Server
An MCP server for WordPress and Elementor, enabling AI assistants to manage content and build pages.
Bio-MCP FastQC Server
Provides quality control for biological sequence data using the FastQC and MultiQC tools.
Tencent Cloud Code Analysis
An official MCP server for Tencent Cloud Code Analysis (TCA) to quickly start code analysis and obtain reports.
stdout-mcp-server
Captures and manages stdout logs from multiple processes via a named pipe system for real-time debugging and analysis.
Shallow Research Code Assistant
A multi-agent AI-powered research and code assistant. Requires external API keys for LLM providers, web search, and secure code execution.
GraphQL Schema
Exposes GraphQL schema information to LLMs, allowing them to explore and understand the schema using specialized tools.
Replicate FLUX.1 Kontext [Max]
Image generation and editing using the FLUX.1 Kontext [Max] model via the Replicate API, featuring advanced text rendering and contextual understanding.
Metasploit MCP Server
An MCP server for integrating with the Metasploit Framework, enabling payload generation and management.
MCPatterns
A server for storing and retrieving personalized coding patterns from a local JSONL file.
CodeToPrompt MCP Server
An MCP server for the codetoprompt library, enabling integration with LLM agents.