LLAMA Hot Swap
MCP server for hot-swapping llama.cpp models in Claude Code - launchctl (macOS) + systemd (Linux)
mcp-llama-swap
Hot-swap llama.cpp models inside a running Claude Code session. No context loss. One command.
Plan with a reasoning model. Implement with a coding model. Same session, same context, zero manual overhead.
Supports macOS (launchctl) and Linux (systemd).
Why
Running local LLMs means choosing between a strong reasoning model and a fast coding model. You can't load both on a single machine. Manually swapping models kills your conversation context and flow.
mcp-llama-swap solves this by giving Claude Code a tool to swap the model behind llama-server via your system's service manager (launchctl on macOS, systemd on Linux), while preserving the full conversation history client-side.
Quick Start
Install
# Option A: Run directly with uvx (no install needed)
uvx mcp-llama-swap
# Option B: Install from PyPI
pip install mcp-llama-swap
Configure Claude Code
Add to ~/.claude.json:
{
"mcpServers": {
"llama-swap": {
"command": "uvx",
"args": ["mcp-llama-swap"],
"env": {
"LLAMA_SWAP_CONFIG": "/path/to/config.json"
}
}
}
}
Configure Models
Create config.json (macOS):
{
"plists_dir": "~/.llama-plists",
"health_url": "http://localhost:8000/health",
"health_timeout": 30,
"models": {
"planner": "qwen35-thinking.plist",
"coder": "qwen3-coder.plist",
"fast": "glm-flash.plist"
}
}
Or on Linux:
{
"services_dir": "~/.llama-services",
"health_url": "http://localhost:8000/health",
"health_timeout": 30,
"models": {
"planner": "llama-server-planner.service",
"coder": "llama-server-coder.service"
}
}
Use
Inside Claude Code:
You: list models
You: swap to planner
You: <discuss architecture, define interfaces>
You: swap to coder and implement the plan
That's it. Context is preserved across swaps.
You can also generate new model configs directly:
You: create a model config named "reasoning" for /models/qwen3-30b.gguf with 8192 context
How It Works
Claude Code CLI
|
| Anthropic Messages API
v
LiteLLM Proxy (:4000) <-- translates Anthropic -> OpenAI format
|
| OpenAI Chat Completions API
v
llama-server (:8000) <-- model weights swapped via service manager
^
|
mcp-llama-swap <-- this project (launchctl or systemd)
Claude Code speaks Anthropic format. LiteLLM translates to OpenAI format for llama-server. This MCP server manages which model service is loaded via launchctl (macOS) or systemd (Linux).
Conversation context survives swaps because Claude Code holds the full message history client-side and re-sends it with every request.
Model Configuration
Mapped Mode (recommended)
Define aliases for your models. Only mapped models are available. Other service configs in the directory are ignored.
macOS:
{
"plists_dir": "~/.llama-plists",
"health_url": "http://localhost:8000/health",
"health_timeout": 30,
"models": {
"planner": "qwen35-35b-a3b-thinking.plist",
"coder": "qwen3-coder.plist",
"fast": "glm-4-7-flash.plist"
}
}
Linux:
{
"services_dir": "~/.llama-services",
"health_url": "http://localhost:8000/health",
"health_timeout": 30,
"models": {
"planner": "llama-server-planner.service",
"coder": "llama-server-coder.service"
}
}
Swap using your aliases: "swap to coder", "swap to planner".
Directory Mode
Set "models": {} to auto-discover all service configs. Filenames (without extension) become the aliases.
macOS:
{
"plists_dir": "~/.llama-plists",
"models": {}
}
Linux:
{
"services_dir": "~/.llama-services",
"models": {}
}
MCP Tools
| Tool | Description |
|---|---|
list_models | Lists all configured models with load status and current mode |
get_current_model | Returns the alias of the currently loaded model |
swap_model | Unloads current model, loads the specified one, waits for health check |
create_model_config | Generates a new launchd plist (macOS) or systemd unit (Linux) for a model |
MCP Resources
| Resource | Description |
|---|---|
llama-swap://config | Current configuration as JSON |
llama-swap://status | Current model status, health, and platform info |
MCP Prompts
| Prompt | Description |
|---|---|
swap-workflow | Guided plan-then-implement workflow template |
Full Setup Guide
Prerequisites
- macOS with launchctl, or Linux with systemd
- llama-server (llama.cpp) installed
- Model configurations as service files (launchd plists or systemd units)
- Python 3.10+
- Claude Code CLI pointed at a LiteLLM proxy
1. Install mcp-llama-swap
pip install mcp-llama-swap
2. Install and start LiteLLM proxy
pip install litellm
Create litellm_config.yaml:
model_list:
- model_name: "*"
litellm_params:
model: "openai/*"
api_base: "http://localhost:8000/v1"
api_key: "sk-none"
litellm_settings:
drop_params: true
request_timeout: 300
Start it:
litellm --config litellm_config.yaml --port 4000
On macOS, you can use the included ai.litellm.proxy.plist.template to run it as a persistent launchd service (see setup.sh).
3. Point Claude Code at LiteLLM
Add to ~/.zshrc (macOS) or ~/.bashrc (Linux):
export ANTHROPIC_BASE_URL="http://localhost:4000"
export ANTHROPIC_API_KEY="sk-none"
export ANTHROPIC_MODEL="local"
4. Add MCP server to Claude Code
Add to ~/.claude.json:
{
"mcpServers": {
"llama-swap": {
"command": "uvx",
"args": ["mcp-llama-swap"],
"env": {
"LLAMA_SWAP_CONFIG": "/absolute/path/to/config.json"
}
}
}
}
5. Create your config.json
Copy config.example.json (macOS) or config.example.linux.json (Linux) and edit with your model aliases and service filenames.
6. Create model service configs
You can create service configs manually, or use the create_model_config MCP tool inside Claude Code:
You: create a model config named "coder" for /path/to/model.gguf with 8192 context
This generates the appropriate launchd plist (macOS) or systemd unit file (Linux) in your services directory.
Automated Setup (macOS)
If you prefer a one-shot setup on macOS, clone this repo and run:
git clone https://github.com/oussama-kh/mcp-llama-swap.git ~/mcp-llama-swap
cd ~/mcp-llama-swap
chmod +x setup.sh
./setup.sh
The script creates a virtual environment, installs dependencies, configures the LiteLLM launchd service, and prints the exact config to add.
Configuration Reference
config.json fields:
| Field | Default | Description |
|---|---|---|
services_dir | ~/.llama-plists (macOS) / ~/.llama-services (Linux) | Directory containing model service configs |
plists_dir | — | macOS alias for services_dir (backwards compatible) |
units_dir | — | Linux alias for services_dir |
health_url | http://localhost:8000/health | llama-server health endpoint |
health_timeout | 30 | Seconds to wait for health check after loading |
models | {} | Alias-to-filename map. Empty = directory mode |
platform | auto | Service manager: auto, launchctl, or systemd |
launchctl_mode | legacy | macOS only: legacy (load/unload) or modern (bootstrap/bootout) |
Override config path via the LLAMA_SWAP_CONFIG environment variable.
Platform Details
macOS (launchctl)
Models are managed as launchd services via plist files. Two launchctl modes are available:
- Legacy (default): Uses
launchctl load/unload/list. Works on all macOS versions. - Modern: Uses
launchctl bootstrap/bootout/print. The officially supported API on newer macOS. Enable with"launchctl_mode": "modern"in config.
Linux (systemd)
Models are managed as systemd user services. Unit files in services_dir are symlinked to ~/.config/systemd/user/ and managed via systemctl --user start/stop.
Troubleshooting
LiteLLM not translating correctly: Check /tmp/litellm.stderr.log. Verify llama-server is running: curl http://localhost:8000/health.
Model swap times out: Increase health_timeout in config.json. Large models may need 30+ seconds to load weights into memory.
Claude Code cannot find the MCP server: Verify the LLAMA_SWAP_CONFIG path is absolute. Test directly: python -m mcp_llama_swap.
Mapped model not found: The service filename in models must match an actual file in your services directory.
systemd service won't start: Check journalctl --user -u llama-server-<name> for errors. Ensure llama-server is in your PATH.
launchctl modern mode issues: If bootstrap/bootout commands fail, fall back to "launchctl_mode": "legacy" in config.
Development
# Install with test dependencies
pip install -e ".[test]"
# Run tests
pytest -v
Use Case
This project enables a two-phase AI coding workflow entirely on local hardware:
- Planning phase: Load a reasoning model (e.g., Qwen3.5-35B-A3B with thinking). Discuss architecture, define interfaces, decompose requirements.
- Implementation phase: Swap to a coding model (e.g., Qwen3-Coder-30B). Execute the plan file by file with full conversation context from the planning phase.
No cloud APIs. No data leaving your machine. No context loss between phases.
License
Apache-2.0
관련 서버
Alpha Vantage MCP Server
스폰서Access financial market data: realtime & historical stock, ETF, options, forex, crypto, commodities, fundamentals, technical indicators, & more
Web Accessibility MCP Server
An MCP server that provides web accessibility analysis capabilities using axe-core and Puppeteer.
pageguard-mcp
Privacy compliance scanner for AI coding tools. Detects tracking technologies, cookies, and third-party data collection from local projects and live websites.
Screeny
A macOS-only server that enables LLMs to capture screenshots of specific application windows, providing visual context for development and debugging.
Claude Code MCP
Orchestrates multiple Claude Code agents across iTerm2 sessions, providing centralized management and inter-agent communication.
MCP Time Server
Provides tools for getting the current time and date, and formatting timestamps in various formats and timezones.
NeuroDev MCP Server
A powerful Model Context Protocol (MCP) server that supercharges your Python development workflow with AI-powered code review, intelligent test generation, and comprehensive test execution.
weibaohui/kom
Provides multi-cluster Kubernetes management and operations using MCP, It can be integrated as an SDK into your own project and includes nearly 50 built-in tools covering common DevOps and development scenarios. Supports both standard and CRD resources.
Image Generation MCP Server
An MCP server for generating images using the Replicate API and the Flux model.
Deep Code Reasoning MCP Server
Performs complementary code analysis by combining Claude Code and Google's Gemini AI.
MCP REST Server
A server for interacting with REST APIs, featuring authentication and Swagger documentation support.