LLAMA Hot Swap
MCP server for hot-swapping llama.cpp models in Claude Code - launchctl (macOS) + systemd (Linux)
mcp-llama-swap
Hot-swap llama.cpp models inside a running Claude Code session. No context loss. One command.
Plan with a reasoning model. Implement with a coding model. Same session, same context, zero manual overhead.
Supports macOS (launchctl) and Linux (systemd).
Why
Running local LLMs means choosing between a strong reasoning model and a fast coding model. You can't load both on a single machine. Manually swapping models kills your conversation context and flow.
mcp-llama-swap solves this by giving Claude Code a tool to swap the model behind llama-server via your system's service manager (launchctl on macOS, systemd on Linux), while preserving the full conversation history client-side.
Quick Start
Install
# Option A: Run directly with uvx (no install needed)
uvx mcp-llama-swap
# Option B: Install from PyPI
pip install mcp-llama-swap
Configure Claude Code
Add to ~/.claude.json:
{
"mcpServers": {
"llama-swap": {
"command": "uvx",
"args": ["mcp-llama-swap"],
"env": {
"LLAMA_SWAP_CONFIG": "/path/to/config.json"
}
}
}
}
Configure Models
Create config.json (macOS):
{
"plists_dir": "~/.llama-plists",
"health_url": "http://localhost:8000/health",
"health_timeout": 30,
"models": {
"planner": "qwen35-thinking.plist",
"coder": "qwen3-coder.plist",
"fast": "glm-flash.plist"
}
}
Or on Linux:
{
"services_dir": "~/.llama-services",
"health_url": "http://localhost:8000/health",
"health_timeout": 30,
"models": {
"planner": "llama-server-planner.service",
"coder": "llama-server-coder.service"
}
}
Use
Inside Claude Code:
You: list models
You: swap to planner
You: <discuss architecture, define interfaces>
You: swap to coder and implement the plan
That's it. Context is preserved across swaps.
You can also generate new model configs directly:
You: create a model config named "reasoning" for /models/qwen3-30b.gguf with 8192 context
How It Works
Claude Code CLI
|
| Anthropic Messages API
v
LiteLLM Proxy (:4000) <-- translates Anthropic -> OpenAI format
|
| OpenAI Chat Completions API
v
llama-server (:8000) <-- model weights swapped via service manager
^
|
mcp-llama-swap <-- this project (launchctl or systemd)
Claude Code speaks Anthropic format. LiteLLM translates to OpenAI format for llama-server. This MCP server manages which model service is loaded via launchctl (macOS) or systemd (Linux).
Conversation context survives swaps because Claude Code holds the full message history client-side and re-sends it with every request.
Model Configuration
Mapped Mode (recommended)
Define aliases for your models. Only mapped models are available. Other service configs in the directory are ignored.
macOS:
{
"plists_dir": "~/.llama-plists",
"health_url": "http://localhost:8000/health",
"health_timeout": 30,
"models": {
"planner": "qwen35-35b-a3b-thinking.plist",
"coder": "qwen3-coder.plist",
"fast": "glm-4-7-flash.plist"
}
}
Linux:
{
"services_dir": "~/.llama-services",
"health_url": "http://localhost:8000/health",
"health_timeout": 30,
"models": {
"planner": "llama-server-planner.service",
"coder": "llama-server-coder.service"
}
}
Swap using your aliases: "swap to coder", "swap to planner".
Directory Mode
Set "models": {} to auto-discover all service configs. Filenames (without extension) become the aliases.
macOS:
{
"plists_dir": "~/.llama-plists",
"models": {}
}
Linux:
{
"services_dir": "~/.llama-services",
"models": {}
}
MCP Tools
| Tool | Description |
|---|---|
list_models | Lists all configured models with load status and current mode |
get_current_model | Returns the alias of the currently loaded model |
swap_model | Unloads current model, loads the specified one, waits for health check |
create_model_config | Generates a new launchd plist (macOS) or systemd unit (Linux) for a model |
MCP Resources
| Resource | Description |
|---|---|
llama-swap://config | Current configuration as JSON |
llama-swap://status | Current model status, health, and platform info |
MCP Prompts
| Prompt | Description |
|---|---|
swap-workflow | Guided plan-then-implement workflow template |
Full Setup Guide
Prerequisites
- macOS with launchctl, or Linux with systemd
- llama-server (llama.cpp) installed
- Model configurations as service files (launchd plists or systemd units)
- Python 3.10+
- Claude Code CLI pointed at a LiteLLM proxy
1. Install mcp-llama-swap
pip install mcp-llama-swap
2. Install and start LiteLLM proxy
pip install litellm
Create litellm_config.yaml:
model_list:
- model_name: "*"
litellm_params:
model: "openai/*"
api_base: "http://localhost:8000/v1"
api_key: "sk-none"
litellm_settings:
drop_params: true
request_timeout: 300
Start it:
litellm --config litellm_config.yaml --port 4000
On macOS, you can use the included ai.litellm.proxy.plist.template to run it as a persistent launchd service (see setup.sh).
3. Point Claude Code at LiteLLM
Add to ~/.zshrc (macOS) or ~/.bashrc (Linux):
export ANTHROPIC_BASE_URL="http://localhost:4000"
export ANTHROPIC_API_KEY="sk-none"
export ANTHROPIC_MODEL="local"
4. Add MCP server to Claude Code
Add to ~/.claude.json:
{
"mcpServers": {
"llama-swap": {
"command": "uvx",
"args": ["mcp-llama-swap"],
"env": {
"LLAMA_SWAP_CONFIG": "/absolute/path/to/config.json"
}
}
}
}
5. Create your config.json
Copy config.example.json (macOS) or config.example.linux.json (Linux) and edit with your model aliases and service filenames.
6. Create model service configs
You can create service configs manually, or use the create_model_config MCP tool inside Claude Code:
You: create a model config named "coder" for /path/to/model.gguf with 8192 context
This generates the appropriate launchd plist (macOS) or systemd unit file (Linux) in your services directory.
Automated Setup (macOS)
If you prefer a one-shot setup on macOS, clone this repo and run:
git clone https://github.com/oussama-kh/mcp-llama-swap.git ~/mcp-llama-swap
cd ~/mcp-llama-swap
chmod +x setup.sh
./setup.sh
The script creates a virtual environment, installs dependencies, configures the LiteLLM launchd service, and prints the exact config to add.
Configuration Reference
config.json fields:
| Field | Default | Description |
|---|---|---|
services_dir | ~/.llama-plists (macOS) / ~/.llama-services (Linux) | Directory containing model service configs |
plists_dir | — | macOS alias for services_dir (backwards compatible) |
units_dir | — | Linux alias for services_dir |
health_url | http://localhost:8000/health | llama-server health endpoint |
health_timeout | 30 | Seconds to wait for health check after loading |
models | {} | Alias-to-filename map. Empty = directory mode |
platform | auto | Service manager: auto, launchctl, or systemd |
launchctl_mode | legacy | macOS only: legacy (load/unload) or modern (bootstrap/bootout) |
Override config path via the LLAMA_SWAP_CONFIG environment variable.
Platform Details
macOS (launchctl)
Models are managed as launchd services via plist files. Two launchctl modes are available:
- Legacy (default): Uses
launchctl load/unload/list. Works on all macOS versions. - Modern: Uses
launchctl bootstrap/bootout/print. The officially supported API on newer macOS. Enable with"launchctl_mode": "modern"in config.
Linux (systemd)
Models are managed as systemd user services. Unit files in services_dir are symlinked to ~/.config/systemd/user/ and managed via systemctl --user start/stop.
Troubleshooting
LiteLLM not translating correctly: Check /tmp/litellm.stderr.log. Verify llama-server is running: curl http://localhost:8000/health.
Model swap times out: Increase health_timeout in config.json. Large models may need 30+ seconds to load weights into memory.
Claude Code cannot find the MCP server: Verify the LLAMA_SWAP_CONFIG path is absolute. Test directly: python -m mcp_llama_swap.
Mapped model not found: The service filename in models must match an actual file in your services directory.
systemd service won't start: Check journalctl --user -u llama-server-<name> for errors. Ensure llama-server is in your PATH.
launchctl modern mode issues: If bootstrap/bootout commands fail, fall back to "launchctl_mode": "legacy" in config.
Development
# Install with test dependencies
pip install -e ".[test]"
# Run tests
pytest -v
Use Case
This project enables a two-phase AI coding workflow entirely on local hardware:
- Planning phase: Load a reasoning model (e.g., Qwen3.5-35B-A3B with thinking). Discuss architecture, define interfaces, decompose requirements.
- Implementation phase: Swap to a coding model (e.g., Qwen3-Coder-30B). Execute the plan file by file with full conversation context from the planning phase.
No cloud APIs. No data leaving your machine. No context loss between phases.
License
Apache-2.0
関連サーバー
Scout Monitoring MCP
スポンサーPut performance and error data directly in the hands of your AI assistant.
Alpha Vantage MCP Server
スポンサーAccess financial market data: realtime & historical stock, ETF, options, forex, crypto, commodities, fundamentals, technical indicators, & more
Remote MCP Server (Authless)
A remote MCP server deployable on Cloudflare Workers without authentication.
ReAPI OpenAPI
Serves multiple OpenAPI specifications to enable LLM-powered IDE integrations.
MCP Server with Ollama Integration
An MCP server that integrates with Ollama to provide tools for file operations, calculations, and text processing. Requires a running Ollama instance.
OpenRouter MCP Client for Cursor
An MCP client for Cursor that uses OpenRouter.ai to access multiple AI models. Requires an OpenRouter API key.
JavaScript Sandbox
Provides a secure JavaScript execution environment for running code snippets.
LetzAI
An MCP server for image generation using the LetzAI API.
AC to Automation Converter
An AI-powered system that converts Acceptance Criteria (AC) from QA specifications into automated browser testing workflows.
xctools
🍎 MCP server for Xcode's xctrace, xcrun, xcodebuild.
MCP Server
Automate data science stages using your own CSV data files.
RefactorMCP
Automated refactoring tools for C# code transformation using Roslyn.