Crawl4AI MCP Server
An MCP server for advanced web crawling, content extraction, and AI-powered analysis using the crawl4ai library.
Crawl-MCP: Unofficial MCP Server for crawl4ai
β οΈ Important: This is an unofficial MCP server implementation for the excellent crawl4ai library.
Not affiliated with the original crawl4ai project.
A comprehensive Model Context Protocol (MCP) server that wraps the powerful crawl4ai library with advanced AI capabilities. Extract and analyze content from any source: web pages, PDFs, Office documents, YouTube videos, and more. Features intelligent summarization to dramatically reduce token usage while preserving key information.
π Key Features
- π Google Search Integration - 7 optimized search genres with Google official operators
- π Advanced Web Crawling: JavaScript support, deep site mapping, entity extraction
- π Universal Content Extraction: Web pages, PDFs, Word docs, Excel, PowerPoint, ZIP archives
- π€ AI-Powered Summarization: Smart token reduction (up to 88.5%) while preserving essential information
- π¬ YouTube Integration: Extract video transcripts and summaries without API keys
- β‘ Production Ready: 17 specialized tools with comprehensive error handling
π Quick Start
Prerequisites (Required First)
- Python 3.11 δ»₯δΈοΌFastMCP γ Python 3.11+ γθ¦ζ±οΌ
Install system dependencies for Playwright:
Ubuntu 24.04 LTS (Manual Required):
# Manual setup required due to t64 library transition
sudo apt update && sudo apt install -y \
libnss3 libatk-bridge2.0-0 libxss1 libasound2t64 \
libgbm1 libgtk-3-0t64 libxshmfence-dev libxrandr2 \
libxcomposite1 libxcursor1 libxdamage1 libxi6 \
fonts-noto-color-emoji fonts-unifont python3-venv python3-pip
python3 -m venv venv && source venv/bin/activate
pip install playwright==1.55.0 && playwright install chromium
sudo playwright install-deps
Other Linux/macOS:
sudo bash scripts/prepare_for_uvx_playwright.sh
Windows (as Administrator):
scripts/prepare_for_uvx_playwright.ps1
Installation
UVX (Recommended - Easiest):
# After system preparation above - that's it!
uvx --from git+https://github.com/walksoda/crawl-mcp crawl-mcp
Docker (Production-Ready):
# Clone the repository
git clone https://github.com/walksoda/crawl-mcp
cd crawl-mcp
# Build and run with Docker Compose (STDIO mode)
docker-compose up --build
# Or build and run HTTP mode on port 8000
docker-compose --profile http up --build crawl4ai-mcp-http
# Or build manually
docker build -t crawl4ai-mcp .
docker run -it crawl4ai-mcp
Docker Features:
- π§ Multi-Browser Support: Chromium, Firefox, Webkit headless browsers
- π§ Google Chrome: Additional Chrome Stable for compatibility
- β‘ Optimized Performance: Pre-configured browser flags for Docker
- π Security: Non-root user execution
- π¦ Complete Dependencies: All required libraries included
Claude Desktop Setup
UVX Installation:
Add to your claude_desktop_config.json:
{
"mcpServers": {
"crawl-mcp": {
"transport": "stdio",
"command": "uvx",
"args": [
"--from",
"git+https://github.com/walksoda/crawl-mcp",
"crawl-mcp"
],
"env": {
"CRAWL4AI_LANG": "en"
}
}
}
}
Docker HTTP Mode:
{
"mcpServers": {
"crawl-mcp": {
"transport": "http",
"baseUrl": "http://localhost:8000"
}
}
}
For Japanese interface:
"env": {
"CRAWL4AI_LANG": "ja"
}
π Documentation
| Topic | Description |
|---|---|
| Installation Guide | Complete installation instructions for all platforms |
| API Reference | Full tool documentation and usage examples |
| Configuration Examples | Platform-specific setup configurations |
| HTTP Integration | HTTP API access and integration methods |
| Advanced Usage | Power user techniques and workflows |
| Development Guide | Contributing and development setup |
Language-Specific Documentation
π οΈ Tool Overview
Web Crawling
crawl_url- Single page crawling with JavaScript supportdeep_crawl_site- Multi-page site mapping and explorationcrawl_url_with_fallback- Robust crawling with retry strategiesbatch_crawl- Process multiple URLs simultaneously
AI-Powered Analysis
intelligent_extract- Semantic content extraction with custom instructionsauto_summarize- LLM-based summarization for large contentextract_entities- Pattern-based entity extraction (emails, phones, URLs, etc.)
Media Processing
process_file- Convert PDFs, Office docs, ZIP archives to markdownextract_youtube_transcript- Multi-language transcript extractionbatch_extract_youtube_transcripts- Process multiple videos
Search Integration
search_google- Genre-filtered Google search with metadatasearch_and_crawl- Combined search and content extractionbatch_search_google- Multiple search queries with analysis
π― Common Use Cases
Content Research:
search_and_crawl β intelligent_extract β structured analysis
Documentation Mining:
deep_crawl_site β batch processing β comprehensive extraction
Media Analysis:
extract_youtube_transcript β auto_summarize β insight generation
Competitive Intelligence:
batch_crawl β extract_entities β comparative analysis
π¨ Quick Troubleshooting
Installation Issues:
- Run system diagnostics: Use
get_system_diagnosticstool - Re-run setup scripts with proper privileges
- Try development installation method
Performance Issues:
- Use
wait_for_js: truefor JavaScript-heavy sites - Increase timeout for slow-loading pages
- Enable
auto_summarizefor large content
Configuration Issues:
- Check JSON syntax in
claude_desktop_config.json - Verify file paths are absolute
- Restart Claude Desktop after configuration changes
ποΈ Project Structure
- Original Library: crawl4ai by unclecode
- MCP Wrapper: This repository (walksoda)
- Implementation: Unofficial third-party integration
π License
This project is an unofficial wrapper around the crawl4ai library. Please refer to the original crawl4ai license for the underlying functionality.
π€ Contributing
See our Development Guide for contribution guidelines and development setup instructions.
π Related Projects
- crawl4ai - The underlying web crawling library
- Model Context Protocol - The standard this server implements
- Claude Desktop - Primary client for MCP servers
Related Servers
Bright Data
sponsorDiscover, extract, and interact with the web - one interface powering automated access across the public internet.
Scrapeless
Integrate real-time Scrapeless Google SERP(Google Search, Google Flight, Google Map, Google Jobs....) results into your LLM applications. This server enables dynamic context retrieval for AI workflows, chatbots, and research tools.
Intelligent Crawl4AI Agent
An AI-powered web scraping system for high-volume automation and advanced data extraction strategies.
Fetch
Web content fetching and conversion for efficient LLM usage
Web Search
Performs web searches and extracts full page content from search results.
NBA Player Stats
Provides comprehensive NBA player statistics from basketball-reference.com, including career stats, season comparisons, and advanced metrics.
HTML to Markdown MCP
Fetch web pages and convert HTML to clean, formatted Markdown. Handles large pages with automatic file saving to bypass token limits.
302AI BrowserUse
An AI-powered browser automation server for natural language control and web research.
WebforAI Text Extractor
Extracts plain text from web pages using WebforAI.
Crawl MCP
An MCP server for crawling WeChat articles. It supports single and batch crawling with multiple output formats, designed for AI tools like Cursor.
Playwright Record MCP
Browser automation using Playwright with video recording. Enables LLMs to interact with web pages through structured accessibility snapshots.