MCP Multi-Server Architecture

A Model Context Protocol (MCP) implementation with distributed multi-server architecture, Agent-to-Agent (A2A) protocol support, and ML-powered recommendations.

Prerequisites

Python 3.10+
16GB+ RAM recommended
One of:
- Ollama installed OR
- GGUF file

1. Quick Start

Get the client running in 3 steps:

Install Dependencies

# Clone repository
git clone <repo-url>
cd mcp_a2a

# Create virtual environment
python -m venv .venv

# Activate (Linux/macOS)
source .venv/bin/activate

# Activate (Windows PowerShell)
.venv\Scripts\activate

# Install requirements
pip install -r requirements.txt

Choose LLM Backend

Option A: Ollama (recommended as a start)

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Start Ollama server
ollama serve

# Download a model (use 14B+ for best results)
ollama pull qwen2.5:14b

Option B: GGUF (local model files)

# Download a GGUF model (example)
wget https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf

# Register the model
# (After starting client, use :gguf add command)

Start the Client

python client.py

Access web UI at: http://localhost:9000

That's it! The client auto-discovers all MCP servers and tools.

2. Using MCP Servers with Other Clients

Use these MCP servers with Claude Desktop, Cline, or any MCP-compatible client.

Example Configuration

Add to your MCP client config (e.g., claude_desktop_config.json):

{
    "mcpServers": {
        "code_review": {
            "command": "/path/to/mcp_a2a/.venv/bin/python",
            "args": ["/path/to/mcp_a2a/servers/code_review/server.py"]
        },
        "location": {
            "command": "/path/to/mcp_a2a/.venv/bin/python",
            "args": ["/path/to/mcp_a2a/servers/location/server.py"]
        },
        "plex": {
            "command": "/path/to/mcp_a2a/.venv/bin/python",
            "args": ["/path/to/mcp_a2a/servers/plex/server.py"]
        },
        "rag": {
            "command": "/path/to/mcp_a2a/.venv/bin/python",
            "args": ["/path/to/mcp_a2a/servers/rag/server.py"]
        },
        "system_tools": {
            "command": "/path/to/mcp_a2a/.venv/bin/python",
            "args": ["/path/to/mcp_a2a/servers/system_tools/server.py"]
        },
        "text_tools": {
            "command": "/path/to/mcp_a2a/.venv/bin/python",
            "args": ["/path/to/mcp_a2a/servers/text_tools/server.py"]
        },
        "todo": {
            "command": "/path/to/mcp_a2a/.venv/bin/python",
            "args": ["/path/to/mcp_a2a/servers/todo/server.py"]
        },
        "knowledge_base": {
            "command": "/path/to/mcp_a2a/.venv/bin/python",
            "args": ["/path/to/mcp_a2a/servers/knowledge_base/server.py"]
        }
    }
}

Windows paths:

"command": "C:\\path\\to\\mcp_a2a\\.venv\\Scripts\\python.exe"

Available servers:

code_review - Code analysis (5 tools)
location - Weather, time, location (3 tools)
plex - Media library + ML recommendations (17 tools)
rag - Vector search (4 tools)
system_tools - System info (4 tools)
text_tools - Text processing (7 tools)
todo - Task management (6 tools)
knowledge_base - Notes management (10 tools)

3. Client Configuration

Environment Variables

Create .env in project root:

# === LLM Backend ===
MAX_MESSAGE_HISTORY=30          # Chat history limit (default: 20)
LLM_TEMPERATURE=0.3             # Model temperature 0 to 1 (default: 0.3)

# === GGUF Configuration (if using GGUF backend) ===
GGUF_GPU_LAYERS=-1              # -1 = all GPU, 0 = CPU only, N = N layers on GPU
GGUF_CONTEXT_SIZE=4096          # Context window size
GGUF_BATCH_SIZE=512             # Batch size for processing

# === API Keys (optional services) ===
PLEX_URL=http://localhost:32400  # Plex server URL
PLEX_TOKEN=your_token_here       # Get from Plex account settings
WEATHER_TOKEN=your_token_here    # OpenWeatherMap API key
LANGSEARCH_TOKEN=your_token_here # LangSearch API key (https://langsearch.com)

# === A2A Protocol (optional distributed mode) ===
A2A_ENDPOINTS=http://localhost:8010  # Comma-separated endpoints
A2A_EXPOSED_TOOLS=                   # Tool categories to expose (empty = all)

# === Performance Tuning (optional) ===
CONCURRENT_LIMIT=3              # Parallel ingestion jobs (default: 1)
EMBEDDING_BATCH_SIZE=50         # Embeddings per batch (default: 20)
DB_FLUSH_BATCH_SIZE=50          # DB inserts per batch (default: 30)

# === Tool Control (optional) ===
DISABLED_TOOLS=knowledge_base:*,todo:*  # Disable specific tools/categories

Configuration Details

LLM Backend:

ollama: Uses Ollama server (requires ollama serve running)
gguf: Uses local GGUF model files (GPU recommended)

GGUF GPU Layers:

-1: Use all GPU (fastest, requires model fits in VRAM)
0: CPU only (slow but works with any model size)
20: Use 20 layers on GPU (balance for large models on limited VRAM)

Performance Tuning:

EMBEDDING_BATCH_SIZE=50 + DB_FLUSH_BATCH_SIZE=50 = ~6x faster RAG ingestion
For 12GB VRAM, can increase to 100 for even faster processing
CONCURRENT_LIMIT=2 enables parallel media ingestion

Disabled Tools:

Format: category:tool_name or category:*
Example: DISABLED_TOOLS=todo:delete_all_todo_items,system:*
Hidden from :tools list, return error if called

Available Commands

These work in both CLI and web UI:

:commands              - List all available commands
:clear history         - Clear all chat history
:clear session <id>    - Clear session
:sessions              - List all sessions
:stop                  - Stop current operation
:stats                 - Show performance metrics
:tools                 - List available tools (hides disabled)
:tools --all           - Show all tools including disabled
:tool <name>           - Get tool description
:model                 - List all available models
:model <name>          - Switch to a model (auto-detects backend)
:models                - List models (legacy)
:sync                  - Sync to model in last_model.txt
:gguf add <path>       - Register a GGUF model
:gguf remove <alias>   - Remove a GGUF model
:gguf list             - List registered GGUF models
:a2a on                - Enable agent-to-agent mode
:a2a off               - Disable agent-to-agent mode
:a2a status            - Check A2A system status
:env                   - Show environment configuration

API Setup

Weather (OpenWeatherMap):

Sign up at https://openweathermap.org/api
Get API key from account settings
Add to .env: WEATHER_TOKEN=your_key

LangSearch (web search):

Sign up at https://langsearch.com
Get API key from dashboard
Add to .env: LANGSEARCH_TOKEN=your_key

Plex Media Server:

Open Plex web interface
Settings → Network → Show Advanced
Copy server URL (e.g., http://192.168.1.100:32400)
Get token: Settings → Account → Show XML → Copy authToken
Add to .env:

   PLEX_URL=http://your_server_ip:32400
   PLEX_TOKEN=your_token

4. Adding Tools (Developer Guide)

Step 1: Create Tool Server

# Create server directory
mkdir servers/my_tool

# Create server file
touch servers/my_tool/server.py

Step 2: Implement Tool

# servers/my_tool/server.py
import asyncio
from mcp.server import Server
from mcp.types import TextContent
from mcp import tool

mcp = Server("my_tool-server")

@mcp.tool()
def my_function(arg1: str, arg2: int) -> str:
    """
    Short description of what this tool does.
    
    Detailed explanation of behavior, use cases, etc.
    
    Args:
        arg1: Description of arg1
        arg2: Description of arg2
    
    Returns:
        Description of return value
    """
    result = f"Processed {arg1} with {arg2}"
    return result

async def main():
    from mcp.server.stdio import stdio_server
    async with stdio_server() as (read_stream, write_stream):
        await mcp.run(
            read_stream,
            write_stream,
            mcp.create_initialization_options()
        )

if __name__ == "__main__":
    asyncio.run(main())

Step 3: Create Skill Documentation (Optional)

# Create skills directory
mkdir -p servers/my_tool/skills

# Create skill file
touch servers/my_tool/skills/my_feature.md

# My Feature Skill

This skill enables X functionality.

## When to Use
- Use case 1
- Use case 2

## Examples
User: "Do something"
Assistant: [calls my_function with appropriate args]

Step 4: Update Intent Patterns (Optional)

If your tool needs specific routing, update client/langgraph.py:

INTENT_PATTERNS = {
    # ... existing patterns ...
    "my_tool": {
        "pattern": r'\bmy keyword\b|\bmy phrase\b',
        "tools": ["my_function"],
        "priority": 3
    }
}

Step 5: Add External MCP Servers (Optional)

To connect external or third-party MCP servers without writing a local server, create client/external_servers.json. The client auto-discovers this file on startup — no code changes needed.

SSE transport (remote HTTP-based servers):

{
    "external_servers": {
        "deepwiki": {
            "transport": "sse",
            "url": "https://mcp.deepwiki.com/mcp",
            "enabled": true,
            "notes": "DeepWiki — reads wiki content from GitHub repos"
        }
    }
}

Stdio transport (local process servers, e.g. IDE integrations):

{
    "external_servers": {
        "pycharm": {
            "transport": "stdio",
            "command": "/usr/lib/jvm/jdk-17.0.12-oracle-x64/bin/java",
            "args": [
                "-classpath",
                "/path/to/pycharm/plugins/mcpserver/lib/mcpserver-frontend.jar",
                "com.intellij.mcpserver.stdio.McpStdioRunnerKt"
            ],
            "env": {
                "IJ_MCP_SERVER_PORT": "64342"
            },
            "enabled": true,
            "notes": "PyCharm MCP integration — requires PyCharm to be running"
        }
    }
}

Field reference:

Field	Required	Description
`transport`	✅	`"sse"` or `"stdio"`
`url`	SSE only	Full URL to the SSE endpoint
`command`	stdio only	Path to the executable
`args`	stdio only	Command-line arguments as an array
`env`	No	Environment variables passed to the process
`cwd`	No	Working directory (defaults to project root)
`enabled`	No	`false` skips the server without removing it (default: `true`)
`notes`	No	Human-readable description, ignored by the client

WSL2 note: If running the client in WSL2 and connecting to a Windows-hosted stdio process, set IJ_MCP_SERVER_HOST in env to the Windows host IP (cat /etc/resolv.conf | grep nameserver). Stdio servers with IJ_MCP_SERVER_PORT are port-checked before registration — if unreachable the server is skipped cleanly rather than crashing the client.

Step 6: Test & Deploy

# Restart client (auto-discovers new server)
python client.py

# Test in CLI or web UI
> test my new tool

5. Distributed Mode (A2A Protocol)

Run tools on remote servers and expose them via HTTP.

Setup A2A Server

Terminal 1 - Start A2A server:

python a2a_server.py

Server starts on http://localhost:8010

Terminal 2 - Start client:

python client.py

Client auto-connects to A2A endpoints in .env

Control Exposed Tools

Use A2A_EXPOSED_TOOLS to control which categories are publicly accessible:

# Expose specific categories (comma-separated)
A2A_EXPOSED_TOOLS=plex,location,text_tools

# Expose everything (default)
A2A_EXPOSED_TOOLS=

# Available categories:
# plex, location, text_tools, system_tools, code_review,
# rag, todo, knowledge_base

Security:

Empty = all 8 servers exposed (56 tools)
Specified = only listed categories exposed
Exclude todo, knowledge_base, rag to protect personal data

Multi-Endpoint Support

Connect to multiple A2A servers:

# In .env
A2A_ENDPOINTS=http://localhost:8010,http://gpu-server:8020

Client aggregates tools from all successful connections.

Check Available Tools

Via HTTP:

curl http://localhost:8010/tool-categories

Via Client:

> :a2a status

6. Testing

Running Tests

The project includes a comprehensive test suite with 44+ tests covering unit, integration, and end-to-end scenarios.

Run all tests:

pytest

Run specific test categories:

pytest -m unit              # Fast unit tests only
pytest -m integration       # Integration tests
pytest -m e2e              # End-to-end tests

Run with coverage:

pytest -c tests/pytest.coverage.ini

Run specific test file:

pytest tests/unit/test_session_manager.py
pytest tests/integration/test_websocket_flow.py

Test Reports

After running tests, reports are automatically generated in tests/results/:

junit.xml - Test results for CI/CD (Jenkins, GitHub Actions)
coverage.xml - Coverage data for Codecov/Coveralls
test-report.html - Interactive HTML test report
coverage-report.html - Coverage overview with per-file metrics

View HTML reports:

# Test results
open tests/results/test-report.html

# Coverage report
open tests/results/coverage-report.html

Test Structure

tests/
├── conftest.py              # Shared fixtures & config
├── pytest.ini              # Test configuration
├── unit/                   # Fast, isolated tests
│   ├── test_session_manager.py
│   ├── test_models.py
│   ├── test_context_tracker.py
│   ├── test_intent_patterns.py
│   └── test_code_review_tools.py
├── integration/            # Multiple component tests
│   ├── test_websocket_flow.py
│   └── test_langgraph_agent.py
├── e2e/                   # Full system tests
│   └── test_full_conversation.py
└── results/               # Generated reports
    ├── junit.xml
    ├── coverage.xml
    ├── test-report.html
    ├── coverage-report.html
    └── generate_html.py

Writing Tests

Tests use pytest with async support and comprehensive fixtures:

import pytest
from unittest.mock import MagicMock

@pytest.mark.unit
def test_create_session(session_manager):
    """Test session creation"""
    session_id = session_manager.create_session("Test Session")
    
    assert session_id is not None
    assert session_id > 0

@pytest.mark.integration
@pytest.mark.asyncio
async def test_full_workflow(session_manager, mock_llm):
    """Test complete conversation workflow"""
    # Test implementation
    pass

Available fixtures:

session_manager - Temporary database session manager
mock_llm - Mocked LLM for testing
mock_websocket - Mocked WebSocket connection
temp_dir - Temporary directory for test files
sample_python_file - Sample code for testing

Test Dependencies

# Install test dependencies
pip install pytest pytest-asyncio pytest-cov pytest-timeout pytest-xdist

CI/CD Integration

GitHub Actions:

- name: Run tests
  run: pytest

- name: Upload coverage
  uses: codecov/codecov-action@v3
  with:
    files: tests/results/coverage.xml

GitLab CI:

test:
  script:
    - pytest
  artifacts:
    reports:
      junit: tests/results/junit.xml
      coverage_report:
        coverage_format: cobertura
        path: tests/results/coverage.xml

Coverage Goals

Current coverage: ~90% of core client code
Well-tested modules:
- session_manager.py (90%+)
- context_tracker.py (62%+)
- models.py (40%+)
- query_patterns.py (66%+)

Running Tests from PyCharm

Right-click tests/ folder → Run 'pytest in tests'
Or use the green play button next to test functions
Reports auto-generate in tests/results/

Troubleshooting Tests

Import errors:

Ensure you're in the project root: cd /path/to/mcp_a2a
Activate virtual environment: source .venv/bin/activate
Install dependencies: pip install -r requirements.txt

Async test failures:

Install pytest-asyncio: pip install pytest-asyncio
Tests are auto-marked with @pytest.mark.asyncio

Coverage not working:

Install pytest-cov: pip install pytest-cov
Use coverage config: pytest -c tests/pytest.coverage.ini

7. Architecture

Multi-Server Design

8 specialized MCP servers communicate via stdio:

servers/
├── code_review/       5 tools  - Code analysis
├── knowledge_base/   10 tools  - Notes management
├── location/          3 tools  - Weather, time, location
├── plex/             17 tools  - Media + ML recommendations
├── rag/               4 tools  - Vector search
├── system_tools/      4 tools  - System info
├── text_tools/        7 tools  - Text processing
└── todo/              6 tools  - Task management

Total: 56 local tools

Directory Structure

mcp_a2a/
├── servers/           # MCP servers (stdio)
│   ├── plex/
│   │   ├── server.py
│   │   ├── ml_recommender.py
│   │   └── skills/
│   └── ...
├── a2a_server.py     # A2A HTTP server
├── client.py         # AI agent client
├── client/
│   ├── ui/
│   │   ├── index.html      # Web UI
│   │   └── dashboard.html  # Dashboard UI
│   ├── langgraph.py  # Agent execution
│   ├── websocket.py  # WebSocket server
│   └── ...
└── tools/            # Tool implementations

8. Example Prompts & Troubleshooting

Example Prompts

Weather:

> What's the weather in Vancouver?

Plex ML Recommendations:

> What should I watch tonight?
> Recommend unwatched SciFi movies
> Show recommender stats

Code Analysis:

> Analyze the code in /path/to/project
> Review this Python file for bugs

Task Management:

> Add "deploy feature" to my todos
> List my todos

Web Search (via LangSearch):

> Who won the 2024 NBA championship?
> Latest AI developments

RAG (Retrieval-Augmented Generation):

Automatic ingestion from web research:

> Write a report about quantum computing using 
  https://en.wikipedia.org/wiki/Quantum_computing and
  https://en.wikipedia.org/wiki/Quantum_algorithm as sources

✅ Fetches both Wikipedia pages
✅ Automatically stores content in RAG (16 chunks, ~2.5s)
✅ Generates report using the content
💾 Content available for future searches

Retrieving stored content:

> Use the rag_search_tool to search for "quantum entanglement"
> What do you have in the RAG about algorithm complexity?
> Search the rag for information about superposition

Checking RAG status:

> What's in the rag database?
> Show rag stats

How RAG works:

When you research topics using URLs, content is automatically chunked (350 tokens max), embedded using bge-large, and stored in SQLite
URLs are deduplicated - the same page won't be stored twice
Semantic search finds the most relevant content even if exact keywords don't match
Search returns top 5 results with similarity scores and source URLs

Troubleshooting

Ollama models not appearing:

# Make sure Ollama is running
ollama serve

# Check models are downloaded
ollama list

# Restart client
python client.py

GGUF model won't load:

Check model size vs VRAM (use models <7GB for 12GB VRAM)
Reduce GPU layers: export GGUF_GPU_LAYERS=20
Increase timeout: export GGUF_LOAD_TIMEOUT=300
Use CPU only: export GGUF_GPU_LAYERS=0

Web UI won't load:

# Check ports are available: 8765, 8766, 9000
netstat -an | grep LISTEN

# Try localhost directly
http://localhost:9000

A2A server not connecting:

# Verify server is running
curl http://localhost:8010/.well-known/agent-card.json

# Check A2A_ENDPOINTS in .env

LangSearch not working:

Verify LANGSEARCH_TOKEN in .env
Check API key at https://langsearch.com
System falls back to LLM if unavailable

RAG search returns wrong results:

RAG uses semantic similarity - it returns the closest matches even if they're not exact
If searching for content that doesn't exist, it returns the most similar available content
Check what's in the database: > show rag stats
Content is only stored after researching URLs or manually adding via rag_add_tool

RAG ingestion is slow:

Current performance: ~2.5 seconds for 16 chunks (10,000 characters)
Embeddings: ~0.5-2s (concurrent processing with 5 workers)
Database insert: ~0.02s (binary embeddings with batch inserts)
If slower, check Ollama is running: ollama list

Conversation history not working ("what was my last prompt" fails):

Cause: Smaller models (7B parameters) often refuse to answer questions about conversation history, even when the data is present in their context
Solution 1 (Recommended): Switch to a larger model with better instruction following:

  ollama pull qwen2.5:14b
  :model qwen2.5:14b

Solution 2: Use models known for good instruction following:
- qwen2.5:14b or qwen2.5:32b (excellent, 80-95% success rate)
- llama3.1:8b (good, ~70% success rate)
- mistral-nemo (good, ~70% success rate)
- Avoid: qwen2.5:3b, qwen2.5:7b (~10-30% success rate)
Why this happens: Smaller models prioritize their safety training ("don't claim knowledge you don't have") over system instructions, causing them to deny access to conversation history even when it's available in their context window
Expected behavior with larger models:

  You: "what's the weather?"
  Bot: "It's sunny, 22°C"
  You: "what was my last prompt?"
  Bot: "Your last prompt was: what's the weather?"

Tools not appearing:

# Check tool is enabled
:tools --all

# Check DISABLED_TOOLS in .env

# Restart client
python client.py

License

MIT License

AgentHub