Crawl4AI MCP Server
An MCP server for advanced web crawling, content extraction, and AI-powered analysis using the crawl4ai library.
Crawl-MCP: Unofficial MCP Server for crawl4ai
β οΈ Important: This is an unofficial MCP server implementation for the excellent crawl4ai library.
Not affiliated with the original crawl4ai project.
A comprehensive Model Context Protocol (MCP) server that wraps the powerful crawl4ai library with advanced AI capabilities. Extract and analyze content from any source: web pages, PDFs, Office documents, YouTube videos, and more. Features intelligent summarization to dramatically reduce token usage while preserving key information.
π Key Features
- π Google Search Integration - 7 optimized search genres with Google official operators
- π Advanced Web Crawling: JavaScript support, deep site mapping, entity extraction
- π Universal Content Extraction: Web pages, PDFs, Word docs, Excel, PowerPoint, ZIP archives
- π€ AI-Powered Summarization: Smart token reduction (up to 88.5%) while preserving essential information
- π¬ YouTube Integration: Extract video transcripts and summaries without API keys
- β‘ Production Ready: 19 specialized tools with comprehensive error handling
π Quick Start
Prerequisites (Required First)
- Python 3.11 δ»₯δΈοΌFastMCP γ Python 3.11+ γθ¦ζ±οΌ
Install system dependencies for Playwright:
Ubuntu 24.04 LTS (Manual Required):
# Manual setup required due to t64 library transition
sudo apt update && sudo apt install -y \
libnss3 libatk-bridge2.0-0 libxss1 libasound2t64 \
libgbm1 libgtk-3-0t64 libxshmfence-dev libxrandr2 \
libxcomposite1 libxcursor1 libxdamage1 libxi6 \
fonts-noto-color-emoji fonts-unifont python3-venv python3-pip
python3 -m venv venv && source venv/bin/activate
pip install playwright==1.55.0 && playwright install chromium
sudo playwright install-deps
Other Linux/macOS:
sudo bash scripts/prepare_for_uvx_playwright.sh
Windows (as Administrator):
scripts/prepare_for_uvx_playwright.ps1
Installation
UVX (Recommended - Easiest):
# After system preparation above - that's it!
uvx --from git+https://github.com/walksoda/crawl-mcp crawl-mcp
Docker (Production-Ready):
# Clone the repository
git clone https://github.com/walksoda/crawl-mcp
cd crawl-mcp
# Build and run with Docker Compose (STDIO mode)
docker-compose up --build
# Or build and run HTTP mode on port 8000
docker-compose --profile http up --build crawl4ai-mcp-http
# Or build manually
docker build -t crawl4ai-mcp .
docker run -it crawl4ai-mcp
Docker Features:
- π§ Multi-Browser Support: Chromium, Firefox, Webkit headless browsers
- π§ Google Chrome: Additional Chrome Stable for compatibility
- β‘ Optimized Performance: Pre-configured browser flags for Docker
- π Security: Non-root user execution
- π¦ Complete Dependencies: All required libraries included
Claude Desktop Setup
UVX Installation:
Add to your claude_desktop_config.json:
{
"mcpServers": {
"crawl-mcp": {
"transport": "stdio",
"command": "uvx",
"args": [
"--from",
"git+https://github.com/walksoda/crawl-mcp",
"crawl-mcp"
],
"env": {
"CRAWL4AI_LANG": "en"
}
}
}
}
Docker HTTP Mode:
{
"mcpServers": {
"crawl-mcp": {
"transport": "http",
"baseUrl": "http://localhost:8000"
}
}
}
For Japanese interface:
"env": {
"CRAWL4AI_LANG": "ja"
}
π Documentation
| Topic | Description |
|---|---|
| Installation Guide | Complete installation instructions for all platforms |
| API Reference | Full tool documentation and usage examples |
| Configuration Examples | Platform-specific setup configurations |
| HTTP Integration | HTTP API access and integration methods |
| Advanced Usage | Power user techniques and workflows |
| Development Guide | Contributing and development setup |
Language-Specific Documentation
π οΈ Tool Overview
Web Crawling (3)
crawl_url- Extract web page content with JavaScript supportdeep_crawl_site- Crawl multiple pages from a site with configurable depthcrawl_url_with_fallback- Crawl with fallback strategies for anti-bot sites
Data Extraction (3)
intelligent_extract- Extract specific data from web pages using LLMextract_entities- Extract entities (emails, phones, etc.) from web pagesextract_structured_data- Extract structured data using CSS selectors or LLM
YouTube (4)
extract_youtube_transcript- Extract YouTube transcripts with timestampsbatch_extract_youtube_transcripts- Extract transcripts from multiple YouTube videos (max 3)get_youtube_video_info- Get YouTube video metadata and transcript availabilityextract_youtube_comments- Extract YouTube video comments with pagination
Search (4)
search_google- Search Google with genre filteringbatch_search_google- Perform multiple Google searches (max 3)search_and_crawl- Search Google and crawl top resultsget_search_genres- Get available search genres
File Processing (3)
process_file- Convert PDF, Word, Excel, PowerPoint, ZIP to markdownget_supported_file_formats- Get supported file formats and capabilitiesenhanced_process_large_content- Process large content with chunking and BM25 filtering
Batch Operations (2)
batch_crawl- Crawl multiple URLs with fallback (max 3 URLs)multi_url_crawl- Multi-URL crawl with pattern-based config (max 5 URL patterns)
πΎ Persist Large Results to Disk (token-saver)
All information-gathering tools accept an optional output_path parameter that writes the full fetched content straight to disk and returns a slim metadata-only response. This lets an LLM fetch huge pages, long YouTube transcripts, or whole batches without blowing its context budget β read from the saved file only when needed.
How it works:
- Single-file tools (e.g.
crawl_url,extract_youtube_transcript) write one.md(or.jsonfor JSON-kind tools) β pass an absolute file path; the extension is auto-added if omitted. An existing regular file at that path is rejected unlessoverwrite=true. - Batch tools (
batch_crawl,multi_url_crawl,deep_crawl_site,search_and_crawl,batch_extract_youtube_transcripts) expect an absolute directory path and write one.mdper URL plusindex.json. Any non-existent path is treated as a directory and created β including names containing dots such as/tmp/run.v1. If the path already exists as a regular file, the call is rejected.batch_crawl/multi_url_crawlkeep theirlistreturn shape and embed anoutput_filekey on each success item. - Request-dict tools (
search_google,batch_search_google,search_and_crawl,batch_extract_youtube_transcripts) read the persistence keys directly from their request dict. - Common parameters:
output_path(absolute;Noneor""skips persistence),include_content_in_response(defaultfalseβ whentrue, content is included in the response too, still subject to anycontent_limit/content_offset/max_content_per_pageslicing),overwrite(defaultfalse). - Writes are atomic per file (temp file +
os.replace); parent directories are auto-created; the full unsliced payload is persisted before any slicing or tool-internal truncation so the on-disk copy is always complete even when the response is sliced. - Batch dict tools (
deep_crawl_site,search_and_crawl,batch_extract_youtube_transcripts) skip per-item persistence for items that reportsuccess=false; these still appear inindex.jsonwithfile: nullso callers can reason about the attempt list.
Markdown single-file example:
{
"tool": "crawl_url",
"arguments": {
"url": "https://example.com/long-article",
"output_path": "/tmp/crawl_out/article.md"
}
}
JSON structured extraction (extension auto-added):
{
"tool": "extract_structured_data",
"arguments": {
"url": "https://example.com/products",
"extraction_type": "css",
"css_selectors": {"price": ".price", "name": "h1"},
"output_path": "/tmp/crawl_out/products"
}
}
Batch directory mode:
{
"tool": "batch_crawl",
"arguments": {
"urls": ["https://a.example", "https://b.example"],
"output_path": "/tmp/crawl_out/batch_run1"
}
}
Each persisted markdown file begins with a YAML frontmatter block containing url, title, fetched_at, and source_tool so the artifact is self-describing.
π― Common Use Cases
Content Research:
search_and_crawl β extract_structured_data β analysis
Documentation Mining:
deep_crawl_site β batch processing β extraction
Media Analysis:
extract_youtube_transcript β summarization workflow
Site Mapping:
batch_crawl β multi_url_crawl β comprehensive data
π¨ Quick Troubleshooting
Installation Issues:
- Re-run setup scripts with proper privileges
- Try development installation method
- Check browser dependencies are installed
Performance Issues:
- Use
wait_for_js: truefor JavaScript-heavy sites - Increase timeout for slow-loading pages
- Use
extract_structured_datafor targeted extraction
Configuration Issues:
- Check JSON syntax in
claude_desktop_config.json - Verify file paths are absolute
- Restart Claude Desktop after configuration changes
ποΈ Project Structure
- Original Library: crawl4ai by unclecode
- MCP Wrapper: This repository (walksoda)
- Implementation: Unofficial third-party integration
π License
This project is an unofficial wrapper around the crawl4ai library. Please refer to the original crawl4ai license for the underlying functionality.
π€ Contributing
See our Development Guide for contribution guidelines and development setup instructions.
π Related Projects
- crawl4ai - The underlying web crawling library
- Model Context Protocol - The standard this server implements
- Claude Desktop - Primary client for MCP servers
Server Terkait
Bright Data
sponsorDiscover, extract, and interact with the web - one interface powering automated access across the public internet.
Daft.ie MCP Server
Search and retrieve rental property details from Daft.ie via web scraping.
Configurable Puppeteer MCP Server
A configurable MCP server for browser automation using Puppeteer.
Unchained Sky
Browser automation MCP server that connects AI agents to your real Chrome browser with structured page understanding in ~500 tokens
Postman V2
An MCP server that provides access to Postman using V2 api version.
MCP Browser Use Secure
A secure MCP server for browser automation with enhanced security features like multi-layered protection and session isolation.
JCrawl4AI
A Java-based MCP server for interacting with the Crawl4ai web scraping API.
deadlink-checker-mcp
Dead link checker MCP server - find broken links, redirects, and timeouts on any website.
HasData
HasData APIs - Google SERP, Amazon, Zillow, Indeed, Maps, and more
Trends MCP
Real-time trend data from Google (Search, Images, News, Shopping), YouTube, TikTok, Reddit, Amazon, Wikipedia, X (Twitter), LinkedIn, Spotify, GitHub, Steam, npm, App Store, news sentiment and web traffic via one MCP connection. Free API key, 20 requests/day, no credit card required.
Crawl4AI
Web scraping skill for Claude AI. Crawl websites, extract structured data with CSS/LLM strategies, handle dynamic JavaScript content. Built on crawl4ai with complete SDK reference, example scripts, and tests.