Dataproc MCP Server

A production-ready Model Context Protocol (MCP) server for Google Cloud Dataproc operations with intelligent parameter injection, enterprise-grade security, and comprehensive tooling. Designed for seamless integration with Roo (VS Code).

🚀 Quick Start

Recommended: Roo (VS Code) Integration

Add this to your Roo MCP settings:

{
  "mcpServers": {
    "dataproc": {
      "command": "npx",
      "args": ["@dipseth/dataproc-mcp-server@latest"],
      "env": {
        "LOG_LEVEL": "info"
      }
    }
  }
}

With Custom Config File

{
  "mcpServers": {
    "dataproc": {
      "command": "npx",
      "args": ["@dipseth/dataproc-mcp-server@latest"],
      "env": {
        "LOG_LEVEL": "info",
        "DATAPROC_CONFIG_PATH": "/path/to/your/config.json"
      }
    }
  }
}

Alternative: Global Installation

# Install globally
npm install -g @dipseth/dataproc-mcp-server

# Start the server
dataproc-mcp-server

# Or run directly
npx @dipseth/dataproc-mcp-server@latest

5-Minute Setup

Install the package:

npm install -g @dipseth/dataproc-mcp-server@latest

Run the setup:
```
dataproc-mcp --setup
```

Configure authentication:

# Edit the generated config file
nano config/server.json

Start the server:
```
dataproc-mcp
```

🌐 Claude.ai Web App Compatibility

✅ PRODUCTION-READY: Full Claude.ai Integration with HTTPS Tunneling & OAuth

The Dataproc MCP Server now provides complete Claude.ai web app compatibility with a working solution that includes all 22 MCP tools!

🚀 Working Solution (Tested & Verified)

Terminal 1 - Start MCP Server:

DATAPROC_CONFIG_PATH=config/github-oauth-server.json npm start -- --http --oauth --port 8080

Terminal 2 - Start Cloudflare Tunnel:

cloudflared tunnel --url https://localhost:8443 --origin-server-name localhost --no-tls-verify

Result: Claude.ai can see and use all tools successfully! 🎉

Key Features:

✅ Complete Tool Access - All 22 MCP tools available in Claude.ai
✅ HTTPS Tunneling - Cloudflare tunnel for secure external access
✅ OAuth Authentication - GitHub OAuth for secure authentication
✅ Trusted Certificates - No browser warnings or connection issues
✅ WebSocket Support - Full WebSocket compatibility with Claude.ai
✅ Production Ready - Tested and verified working solution

Quick Setup:

Setup GitHub OAuth (5 minutes)
Generate SSL certificates: npm run ssl:generate
Start services (2 terminals as shown above)
Connect Claude.ai to your tunnel URL

📖 Complete Guide: See docs/claude-ai-integration.md for detailed setup instructions, troubleshooting, and advanced features.

📖 Certificate Setup: See docs/trusted-certificates.md for SSL certificate configuration.

✨ Features

🎯 Core Capabilities

22 Production-Ready MCP Tools - Complete Dataproc management suite
🧠 Knowledge Base Semantic Search - Natural language queries with optional Qdrant integration
🚀 Response Optimization - 60-96% token reduction with Qdrant storage
🔄 Generic Type Conversion System - Automatic, type-safe data transformations
60-80% Parameter Reduction - Intelligent default injection
Multi-Environment Support - Dev/staging/production configurations
Service Account Impersonation - Enterprise authentication
Real-time Job Monitoring - Comprehensive status tracking

🚀 Response Optimization

96.2% Token Reduction - list_clusters: 7,651 → 292 tokens
Automatic Qdrant Storage - Full data preserved and searchable
Resource URI Access - dataproc://responses/clusters/list/abc123
Graceful Fallback - Works without Qdrant, falls back to full responses
9.95ms Processing - Lightning-fast optimization with <1MB memory usage

🔄 Generic Type Conversion System

75% Code Reduction - Eliminates manual conversion logic across services
Type-Safe Transformations - Automatic field detection and mapping
Intelligent Compression - Field-level compression with configurable thresholds
0.50ms Conversion Times - Lightning-fast processing with 100% compression ratios
Zero-Configuration - Works automatically with existing TypeScript types
Backward Compatible - Seamless integration with existing functionality

� Enterprise Security

Input Validation - Zod schemas for all 16 tools
Rate Limiting - Configurable abuse prevention
Credential Management - Secure handling and rotation
Audit Logging - Comprehensive security event tracking
Threat Detection - Injection attack prevention

📊 Quality Assurance

90%+ Test Coverage - Comprehensive test suite
Performance Monitoring - Configurable thresholds
Multi-Environment Testing - Cross-platform validation
Automated Quality Gates - CI/CD integration
Security Scanning - Vulnerability management

🚀 Developer Experience

5-Minute Setup - Quick start guide
Interactive Documentation - HTML docs with examples
Comprehensive Examples - Multi-environment configs
Troubleshooting Guides - Common issues and solutions
IDE Integration - TypeScript support

🛠️ Complete MCP Tools Suite (22 Tools)

🔄 Enhanced with Generic Type Conversion: All tools now benefit from automatic, type-safe data transformations with intelligent compression and field mapping.

🚀 Cluster Management (8 Tools)

Tool	Description	Smart Defaults	Key Features
`start_dataproc_cluster`	Create and start new clusters	✅ 80% fewer params	Profile-based, auto-config
`create_cluster_from_yaml`	Create from YAML configuration	✅ Project/region injection	Template-driven setup
`create_cluster_from_profile`	Create using predefined profiles	✅ 85% fewer params	8 built-in profiles
`list_clusters`	List all clusters with filtering	✅ No params needed	Semantic queries, pagination
`list_tracked_clusters`	List MCP-created clusters	✅ Profile filtering	Creation tracking
`get_cluster`	Get detailed cluster information	✅ 75% fewer params	Semantic data extraction
`delete_cluster`	Delete existing clusters	✅ Project/region defaults	Safe deletion
`get_zeppelin_url`	Get Zeppelin notebook URL	✅ Auto-discovery	Web interface access

💼 Job Management (7 Tools)

Tool	Description	Smart Defaults	Key Features
`submit_hive_query`	Submit Hive queries to clusters	✅ 70% fewer params	Async support, timeouts
`submit_dataproc_job`	Submit Spark/PySpark/Presto jobs	✅ 75% fewer params	Multi-engine support, Local file staging
`cancel_dataproc_job`	Cancel running or pending jobs	✅ JobID only needed	Emergency cancellation, cost control
`get_job_status`	Get job execution status	✅ JobID only needed	Real-time monitoring
`get_job_results`	Get job outputs and results	✅ Auto-pagination	Result formatting
`get_query_status`	Get Hive query status	✅ Minimal params	Query tracking
`get_query_results`	Get Hive query results	✅ Smart pagination	Enhanced async support

📋 Configuration & Profiles (3 Tools)

Tool	Description	Smart Defaults	Key Features
`list_profiles`	List available cluster profiles	✅ Category filtering	8 production profiles
`get_profile`	Get detailed profile configuration	✅ Profile ID only	Template access
`query_cluster_data`	Query stored cluster data	✅ Natural language	Semantic search

📊 Analytics & Insights (4 Tools)

Tool	Description	Smart Defaults	Key Features
`check_active_jobs`	Quick status of all active jobs	✅ No params needed	Multi-project view
`get_cluster_insights`	Comprehensive cluster analytics	✅ Auto-discovery	Machine types, components
`get_job_analytics`	Job performance analytics	✅ Success rates	Error patterns, metrics
`query_knowledge`	Query comprehensive knowledge base	✅ Natural language	Clusters, jobs, errors

🎯 Key Capabilities

🧠 Semantic Search: Natural language queries with Qdrant integration
⚡ Smart Defaults: 60-80% parameter reduction through intelligent injection
📊 Response Optimization: 96% token reduction with full data preservation
🔄 Async Support: Non-blocking job submission and monitoring
🏷️ Profile System: 8 production-ready cluster templates
📈 Analytics: Comprehensive insights and performance tracking

📋 Configuration

Project-Based Configuration

The server supports a project-based configuration format:

# profiles/@analytics-workloads.yaml
my-company-analytics-prod-1234:
  region: us-central1
  tags:
    - DataProc
    - analytics
    - production
  labels:
    service: analytics-service
    owner: data-team
    environment: production
  cluster_config:
    # ... cluster configuration

Authentication Methods

Service Account Impersonation (Recommended)
Direct Service Account Key
Application Default Credentials
Hybrid Authentication with fallbacks

📚 Documentation

Quick Start Guide - Get started in 5 minutes
Knowledge Base Semantic Search - Natural language queries and setup
Generic Type Conversion System - Architectural design and implementation
Generic Converter Migration Guide - Migration from manual conversions
API Reference - Complete tool documentation
Configuration Examples - Real-world configurations
Security Guide - Best practices and compliance
Installation Guide - Detailed setup instructions

🔧 MCP Client Integration

Claude Desktop

{
  "mcpServers": {
    "dataproc": {
      "command": "npx",
      "args": ["@dataproc/mcp-server"],
      "env": {
        "LOG_LEVEL": "info"
      }
    }
  }
}

Roo (VS Code)

{
  "mcpServers": {
    "dataproc-server": {
      "command": "npx",
      "args": ["@dataproc/mcp-server"],
      "disabled": false,
      "alwaysAllow": [
        "list_clusters",
        "get_cluster",
        "list_profiles"
      ]
    }
  }
}

🏗️ Architecture

┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   MCP Client    │────│  Dataproc MCP    │────│  Google Cloud   │
│  (Claude/Roo)   │    │     Server       │    │    Dataproc     │
└─────────────────┘    └──────────────────┘    └─────────────────┘
                              │
                       ┌──────┴──────┐
                       │   Features  │
                       ├─────────────┤
                       │ • Security  │
                       │ • Profiles  │
                       │ • Validation│
                       │ • Monitoring│
                       │ • Generic    │
                       │   Converter  │
                       └─────────────┘

🔄 Generic Type Conversion System Architecture

┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│  Source Types   │────│ Generic Converter │────│ Qdrant Payloads │
│ • ClusterData   │    │    System        │    │ • Compressed    │
│ • QueryResults  │    │                  │    │ • Type-Safe     │
│ • JobData       │    │ ┌──────────────┐ │    │ • Optimized     │
└─────────────────┘    │ │Field Analyzer│ │    └─────────────────┘
                       │ │Transformation│ │
                       │ │Engine        │ │
                       │ │Compression   │ │
                       │ │Service       │ │
                       │ └──────────────┘ │
                       └──────────────────┘

🚦 Performance

Response Time Achievements

Schema Validation: ~2ms (target: <5ms) ✅
Parameter Injection: ~1ms (target: <2ms) ✅
Generic Type Conversion: ~0.50ms (target: <2ms) ✅
Credential Validation: ~25ms (target: <50ms) ✅
MCP Tool Call: ~50ms (target: <100ms) ✅

Throughput Achievements

Schema Validation: ~2000 ops/sec ✅
Parameter Injection: ~5000 ops/sec ✅
Generic Type Conversion: ~2000 ops/sec ✅
Credential Validation: ~200 ops/sec ✅
MCP Tool Call: ~100 ops/sec ✅

Compression Achievements

Field-Level Compression: Up to 100% compression ratios ✅
Memory Optimization: 30-60% reduction in memory usage ✅
Type Safety: Zero runtime type errors with automatic validation ✅

🧪 Testing

# Run all tests
npm test

# Run specific test suites
npm run test:unit
npm run test:integration
npm run test:performance

# Run with coverage
npm run test:coverage

🤝 Contributing

We welcome contributions! Please see our Contributing Guide for details.

Development Setup

# Clone the repository
git clone https://github.com/dipseth/dataproc-mcp.git
cd dataproc-mcp

# Install dependencies
npm install

# Build the project
npm run build

# Run tests
npm test

# Start development server
npm run dev

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🆘 Support

GitHub Issues: Report bugs and request features
Documentation: Complete documentation
NPM Package: Package information

🏆 Acknowledgments

Model Context Protocol - The protocol that makes this possible
Google Cloud Dataproc - The service we're integrating with
Qdrant - High-performance vector database powering our semantic search and knowledge indexing
TypeScript - For type safety and developer experience

Made with ❤️ for the MCP and Google Cloud communities

Dataproc MCP Server

Dataproc MCP Server

🚀 Quick Start

Recommended: Roo (VS Code) Integration

With Custom Config File

Alternative: Global Installation

5-Minute Setup

🌐 Claude.ai Web App Compatibility

🚀 Working Solution (Tested & Verified)

Key Features:

Quick Setup:

✨ Features

🎯 Core Capabilities

🚀 Response Optimization

🔄 Generic Type Conversion System

� Enterprise Security

📊 Quality Assurance

🚀 Developer Experience

🛠️ Complete MCP Tools Suite (22 Tools)

🚀 Cluster Management (8 Tools)

💼 Job Management (7 Tools)

📋 Configuration & Profiles (3 Tools)

📊 Analytics & Insights (4 Tools)

🎯 Key Capabilities

📋 Configuration

Project-Based Configuration

Authentication Methods

📚 Documentation

🔧 MCP Client Integration

Claude Desktop

Roo (VS Code)

🏗️ Architecture

🔄 Generic Type Conversion System Architecture

🚦 Performance

Response Time Achievements

Throughput Achievements

Compression Achievements

🧪 Testing

🤝 Contributing

Development Setup

📄 License

🆘 Support

🏆 Acknowledgments

Máy chủ liên quan

Freshservice

notebooklm-mcp-secure

Space Frontiers

Remote MCP Server

OpenAI MCP Server

Binance MCP Server

MCP Firebase Server

Cloudflare DNS

WaveGuard

Octodet Keycloak

NotebookLM Web Importer