An MCP server for advanced web crawling, content extraction, and AI-powered analysis using the crawl4ai library.
A comprehensive Model Context Protocol (MCP) server that wraps the powerful crawl4ai library. This server provides advanced web crawling, content extraction, and AI-powered analysis capabilities through the standardized MCP interface.
This feature set enables comprehensive JavaScript-heavy website handling:
wait_for_selector
for specific elementsJavaScript-Heavy Sites Recommended Settings:
{
"wait_for_js": true,
"simulate_user": true,
"timeout": 30-60,
"generate_markdown": true
}
Linux/macOS:
./setup.sh
Windows:
setup_windows.bat
python3 -m venv venv
source venv/bin/activate # Linux/macOS
# or
venv\Scripts\activate.bat # Windows
pip install -r requirements.txt
sudo apt-get update
sudo apt-get install libnss3 libnspr4 libasound2 libatk-bridge2.0-0 libdrm2 libgtk-3-0 libgbm1
STDIO transport (default):
python -m crawl4ai_mcp.server
HTTP transport:
python -m crawl4ai_mcp.server --transport http --host 127.0.0.1 --port 8000
You can register this MCP server with Claude Code CLI. The following methods are available:
.mcp.json
in your project directory:{
"mcpServers": {
"crawl4ai": {
"command": "/home/user/prj/crawl/venv/bin/python",
"args": ["-m", "crawl4ai_mcp.server"],
"env": {
"FASTMCP_LOG_LEVEL": "DEBUG"
}
}
}
}
claude mcp
or start Claude Code from the project directory# Register the MCP server with claude command
claude mcp add crawl4ai "/path/to/your/venv/bin/python -m crawl4ai_mcp.server" \
--cwd /path/to/your/crawl4ai-mcp-project
# With environment variables
claude mcp add crawl4ai "/path/to/your/venv/bin/python -m crawl4ai_mcp.server" \
--cwd /path/to/your/crawl4ai-mcp-project \
-e FASTMCP_LOG_LEVEL=DEBUG
# With project scope (shared with team)
claude mcp add crawl4ai "/path/to/your/venv/bin/python -m crawl4ai_mcp.server" \
--cwd /path/to/your/crawl4ai-mcp-project \
--scope project
# First start the HTTP server
python -m crawl4ai_mcp.server --transport http --host 127.0.0.1 --port 8000
# Then register the HTTP endpoint
claude mcp add crawl4ai-http --transport http --url http://127.0.0.1:8000/mcp
# Or with Pure StreamableHTTP (recommended)
./scripts/start_pure_http_server.sh
claude mcp add crawl4ai-pure-http --transport http --url http://127.0.0.1:8000/mcp
# List registered MCP servers
claude mcp list
# Test the connection
claude mcp test crawl4ai
# Remove if needed
claude mcp remove crawl4ai
# Add with environment variables for LLM functionality
claude mcp add crawl4ai "python -m crawl4ai_mcp.server" \
--cwd /path/to/your/crawl4ai-mcp-project \
-e OPENAI_API_KEY=your_openai_key \
-e ANTHROPIC_API_KEY=your_anthropic_key
Start Server by running the startup script:
./scripts/start_pure_http_server.sh
Apply Configuration using one of these methods:
configs/claude_desktop_config_pure_http.json
to Claude Desktop's config directory{
"mcpServers": {
"crawl4ai-pure-http": {
"url": "http://127.0.0.1:8000/mcp"
}
}
}
Restart Claude Desktop to apply settings
Start Using the tools - crawl4ai tools are now available in chat
Copy the configuration:
cp configs/claude_desktop_config.json ~/.config/claude-desktop/claude_desktop_config.json
Restart Claude Desktop to enable the crawl4ai tools
Windows:
%APPDATA%\Claude\claude_desktop_config.json
macOS:
~/Library/Application Support/Claude/claude_desktop_config.json
Linux:
~/.config/claude-desktop/claude_desktop_config.json
This MCP server supports multiple HTTP protocols, allowing you to choose the optimal implementation for your use case.
Pure JSON HTTP protocol without Server-Sent Events (SSE)
# Method 1: Using startup script
./scripts/start_pure_http_server.sh
# Method 2: Direct startup
python examples/simple_pure_http_server.py --host 127.0.0.1 --port 8000
# Method 3: Background startup
nohup python examples/simple_pure_http_server.py --port 8000 > server.log 2>&1 &
{
"mcpServers": {
"crawl4ai-pure-http": {
"url": "http://127.0.0.1:8000/mcp"
}
}
}
./scripts/start_pure_http_server.sh
configs/claude_desktop_config_pure_http.json
# Health check
curl http://127.0.0.1:8000/health
# Complete test
python examples/pure_http_test.py
Traditional FastMCP StreamableHTTP protocol (with SSE)
# Method 1: Command line
python -m crawl4ai_mcp.server --transport http --host 127.0.0.1 --port 8001
# Method 2: Environment variables
export MCP_TRANSPORT=http
export MCP_HOST=127.0.0.1
export MCP_PORT=8001
python -m crawl4ai_mcp.server
{
"mcpServers": {
"crawl4ai-legacy-http": {
"url": "http://127.0.0.1:8001/mcp"
}
}
}
Feature | Pure StreamableHTTP | Legacy HTTP (SSE) | STDIO |
---|---|---|---|
Response Format | Plain JSON | Server-Sent Events | Binary |
Configuration Complexity | Low (URL only) | Low (URL only) | High (Process management) |
Debug Ease | High (curl compatible) | Medium (SSE parser needed) | Low |
Independence | High | High | Low |
Performance | High | Medium | High |
# Initialize
SESSION_ID=$(curl -s -X POST http://127.0.0.1:8000/mcp/initialize \
-H "Content-Type: application/json" \
-d '{"jsonrpc":"2.0","id":"init","method":"initialize","params":{"protocolVersion":"2024-11-05","capabilities":{},"clientInfo":{"name":"test","version":"1.0.0"}}}' \
-D- | grep -i mcp-session-id | cut -d' ' -f2 | tr -d '\r')
# Execute tool
curl -X POST http://127.0.0.1:8000/mcp \
-H "Content-Type: application/json" \
-H "mcp-session-id: $SESSION_ID" \
-d '{"jsonrpc":"2.0","id":"crawl","method":"tools/call","params":{"name":"crawl_url","arguments":{"url":"https://example.com"}}}'
curl -X POST "http://127.0.0.1:8001/tools/crawl_url" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com", "generate_markdown": true}'
Method 1: Command Line
python -m crawl4ai_mcp.server --transport http --host 127.0.0.1 --port 8000
Method 2: Environment Variables
export MCP_TRANSPORT=http
export MCP_HOST=127.0.0.1
export MCP_PORT=8000
python -m crawl4ai_mcp.server
Method 3: Docker (if available)
docker run -p 8000:8000 crawl4ai-mcp --transport http --port 8000
Once running, the HTTP API provides:
http://127.0.0.1:8000
http://127.0.0.1:8000/docs
http://127.0.0.1:8000/tools/{tool_name}
http://127.0.0.1:8000/resources/{resource_uri}
All MCP tools (crawl_url, intelligent_extract, process_file, etc.) are accessible via HTTP POST requests with JSON payloads matching the tool parameters.
curl -X POST "http://127.0.0.1:8000/tools/crawl_url" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com", "generate_markdown": true}'
For detailed HTTP API documentation, examples, and integration guides, see the HTTP API Guide.
Use Case | Recommended Tool | Key Features |
---|---|---|
Single webpage | crawl_url | Basic crawling, JS support |
Multiple pages (up to 5) | deep_crawl_site | Site mapping, link following |
Search + Crawling | search_and_crawl | Google search + auto-crawl |
Difficult sites | crawl_url_with_fallback | Multiple retry strategies |
Extract specific data | intelligent_extract | AI-powered extraction |
Find patterns | extract_entities | Emails, phones, URLs, etc. |
Structured data | extract_structured_data | CSS/XPath/LLM schemas |
File processing | process_file | PDF, Office, ZIP conversion |
YouTube content | extract_youtube_transcript | Subtitle extraction |
pages Γ base_timeout
recommendedFor JavaScript-Heavy Sites:
wait_for_js: true
simulate_user: true
for better compatibilitywait_for_selector
for specific elementsFor AI Features:
get_llm_config_info
auto_summarize: true
intelligent_extract
for semantic understandingcrawl_url
Advanced web crawling with deep crawling support, intelligent filtering, and automatic summarization for large content.
Key Parameters:
url
: Target URL to crawlmax_depth
: Maximum crawling depth (None for single page)crawl_strategy
: Strategy type ('bfs', 'dfs', 'best_first')content_filter
: Filter type ('bm25', 'pruning', 'llm')chunk_content
: Enable content chunking for large documentsexecute_js
: Custom JavaScript code executionuser_agent
: Custom user agent stringheaders
: Custom HTTP headerscookies
: Authentication cookiesauto_summarize
: Automatically summarize large content using LLMmax_content_tokens
: Maximum tokens before triggering auto-summarization (default: 15000)summary_length
: Summary length setting ('short', 'medium', 'long')llm_provider
: LLM provider for summarization (auto-detected if not specified)llm_model
: Specific LLM model for summarization (auto-detected if not specified)deep_crawl_site
Dedicated tool for comprehensive site mapping and recursive crawling.
Parameters:
url
: Starting URLmax_depth
: Maximum crawling depth (recommended: 1-3)max_pages
: Maximum number of pages to crawlcrawl_strategy
: Crawling strategy ('bfs', 'dfs', 'best_first')url_pattern
: URL filter pattern (e.g., 'docs', 'blog')score_threshold
: Minimum relevance score (0.0-1.0)intelligent_extract
AI-powered content extraction with advanced filtering and analysis.
Parameters:
url
: Target URLextraction_goal
: Description of extraction targetcontent_filter
: Filter type for content qualityuse_llm
: Enable LLM-based intelligent extractionllm_provider
: LLM provider (openai, claude, etc.)custom_instructions
: Detailed extraction instructionsextract_entities
High-speed entity extraction using regex patterns.
Built-in Entity Types:
emails
: Email addressesphones
: Phone numbersurls
: URLs and linksdates
: Date formatsips
: IP addressessocial_media
: Social media handles (@username, #hashtag)prices
: Price informationcredit_cards
: Credit card numberscoordinates
: Geographic coordinatesextract_structured_data
Traditional structured data extraction using CSS/XPath selectors or LLM schemas.
batch_crawl
Parallel processing of multiple URLs with unified reporting.
crawl_url_with_fallback
Robust crawling with multiple fallback strategies for maximum reliability.
process_file
π File Processing: Convert various file formats to Markdown using Microsoft MarkItDown.
Parameters:
url
: File URL (PDF, Office, ZIP, etc.)max_size_mb
: Maximum file size limit (default: 100MB)extract_all_from_zip
: Extract all files from ZIP archivesinclude_metadata
: Include file metadata in responseSupported Formats:
get_supported_file_formats
π Format Information: Get comprehensive list of supported file formats and their capabilities.
extract_youtube_transcript
πΊ YouTube Processing: Extract transcripts from YouTube videos with language preferences and translation using youtube-transcript-api v1.1.0+.
β Stable and reliable - No authentication required!
Parameters:
url
: YouTube video URLlanguages
: Preferred languages in order of preference (default: ["ja", "en"])translate_to
: Target language for translation (optional)include_timestamps
: Include timestamps in transcriptpreserve_formatting
: Preserve original formattinginclude_metadata
: Include video metadatabatch_extract_youtube_transcripts
πΊ Batch YouTube Processing: Extract transcripts from multiple YouTube videos in parallel.
β Enhanced performance with controlled concurrency for stable batch processing.
Parameters:
urls
: List of YouTube video URLslanguages
: Preferred languages listtranslate_to
: Target language for translation (optional)include_timestamps
: Include timestamps in transcriptmax_concurrent
: Maximum concurrent requests (1-5, default: 3)get_youtube_video_info
π YouTube Info: Get available transcript information for a YouTube video without extracting the full transcript.
Parameters:
video_url
: YouTube video URLReturns:
search_google
π Google Search: Perform Google search with genre filtering and metadata extraction.
Parameters:
query
: Search query stringnum_results
: Number of results to return (1-100, default: 10)language
: Search language (default: "en")region
: Search region (default: "us")search_genre
: Content genre filter (optional)safe_search
: Safe search enabled (always True for security)Features:
batch_search_google
π Batch Google Search: Perform multiple Google searches with comprehensive analysis.
Parameters:
queries
: List of search queriesnum_results_per_query
: Results per query (1-100, default: 10)max_concurrent
: Maximum concurrent searches (1-5, default: 3)language
: Search language (default: "en")region
: Search region (default: "us")search_genre
: Content genre filter (optional)Returns:
search_and_crawl
π Integrated Search+Crawl: Perform Google search and automatically crawl top results.
Parameters:
search_query
: Google search querynum_search_results
: Number of search results (1-20, default: 5)crawl_top_results
: Number of top results to crawl (1-10, default: 3)extract_media
: Extract media from crawled pagesgenerate_markdown
: Generate markdown contentsearch_genre
: Content genre filter (optional)Returns:
get_search_genres
π Search Genres: Get comprehensive list of available search genres and their descriptions.
Returns:
uri://crawl4ai/config
: Default crawler configuration optionsuri://crawl4ai/examples
: Usage examples and sample requestscrawl_website_prompt
: Guided website crawling workflowsanalyze_crawl_results_prompt
: Crawl result analysisbatch_crawl_setup_prompt
: Batch crawling setup{
"query": "python machine learning tutorial",
"num_results": 10,
"language": "en",
"region": "us"
}
{
"query": "machine learning research",
"num_results": 15,
"search_genre": "academic",
"language": "en"
}
{
"queries": [
"python programming tutorial",
"web development guide",
"data science introduction"
],
"num_results_per_query": 5,
"max_concurrent": 3,
"search_genre": "education"
}
{
"search_query": "python official documentation",
"num_search_results": 10,
"crawl_top_results": 5,
"extract_media": false,
"generate_markdown": true,
"search_genre": "documentation"
}
{
"url": "https://docs.example.com",
"max_depth": 2,
"max_pages": 20,
"crawl_strategy": "bfs"
}
{
"url": "https://news.example.com",
"extraction_goal": "article summary and key points",
"content_filter": "llm",
"use_llm": true,
"custom_instructions": "Extract main article content, summarize key points, and identify important quotes"
}
{
"url": "https://example.com/document.pdf",
"max_size_mb": 50,
"include_metadata": true
}
{
"url": "https://example.com/report.docx",
"max_size_mb": 25,
"include_metadata": true
}
{
"url": "https://example.com/documents.zip",
"max_size_mb": 100,
"extract_all_from_zip": true,
"include_metadata": true
}
The crawl_url
tool automatically detects file formats and routes to appropriate processing:
{
"url": "https://example.com/mixed-content.pdf",
"generate_markdown": true
}
β Stable youtube-transcript-api v1.1.0+ integration - No setup required!
{
"url": "https://www.youtube.com/watch?v=VIDEO_ID",
"languages": ["ja", "en"],
"include_timestamps": true,
"include_metadata": true
}
{
"url": "https://www.youtube.com/watch?v=VIDEO_ID",
"languages": ["en"],
"translate_to": "ja",
"include_timestamps": false
}
{
"urls": [
"https://www.youtube.com/watch?v=VIDEO_ID1",
"https://www.youtube.com/watch?v=VIDEO_ID2",
"https://youtu.be/VIDEO_ID3"
],
"languages": ["ja", "en"],
"max_concurrent": 3
}
The crawl_url
tool automatically detects YouTube URLs and extracts transcripts:
{
"url": "https://www.youtube.com/watch?v=VIDEO_ID",
"generate_markdown": true
}
{
"video_url": "https://www.youtube.com/watch?v=VIDEO_ID"
}
{
"url": "https://company.com/contact",
"entity_types": ["emails", "phones", "social_media"],
"include_context": true,
"deduplicate": true
}
{
"url": "https://private.example.com",
"auth_token": "Bearer your-token",
"cookies": {"session_id": "abc123"},
"headers": {"X-API-Key": "your-key"}
}
crawl4ai_mcp/
βββ __init__.py # Package initialization
βββ server.py # Main MCP server (1,184+ lines)
βββ strategies.py # Additional extraction strategies
βββ suppress_output.py # Output suppression utilities
config/
βββ claude_desktop_config_windows.json # Claude Desktop config (Windows)
βββ claude_desktop_config_script.json # Script-based config
βββ claude_desktop_config.json # Basic config
docs/
βββ README_ja.md # Japanese documentation
βββ setup_instructions_ja.md # Detailed setup guide
βββ troubleshooting_ja.md # Troubleshooting guide
scripts/
βββ setup.sh # Linux/macOS setup
βββ setup_windows.bat # Windows setup
βββ run_server.sh # Server startup script
ModuleNotFoundError:
pip install -r requirements.txt
Playwright Browser Errors:
sudo apt-get install libnss3 libnspr4 libasound2
JSON Parsing Errors:
For detailed troubleshooting, see docs/troubleshooting_ja.md
.
crawl_url_with_fallback
1. Competitive Analysis: search_and_crawl β intelligent_extract
2. Site Auditing: crawl_url β extract_entities
3. Content Research: search_google β batch_crawl
4. Deep Analysis: deep_crawl_site β structured extraction
simulate_user: true
wait_for_js
for dynamic contentdeep_crawl_site
with URL patternscrawl4ai>=0.3.0
- Advanced web crawling libraryfastmcp>=0.1.0
- MCP server frameworkpydantic>=2.0.0
- Data validation and serializationmarkitdown>=0.0.1a2
- File processing and conversion (Microsoft)googlesearch-python>=1.3.0
- Google search functionalityaiohttp>=3.8.0
- Asynchronous HTTP client for metadata extractionbeautifulsoup4>=4.12.0
- HTML parsing for title/snippet extractionyoutube-transcript-api>=1.1.0
- Stable YouTube transcript extractionasyncio
- Asynchronous programming supporttyping-extensions
- Extended type hintsYouTube Features Status:
The following status information applies to YouTube transcript extraction:
MIT License
This project implements the Model Context Protocol specification. It is compatible with any MCP-compliant client and built with the FastMCP framework for easy extension and modification.
One-click installation for Claude Desktop users
This MCP server is available as a DXT (Desktop Extensions) package for easy installation. The following resources are available:
dxt-packages/crawl4ai-dxt-correct/
Simply drag and drop the .dxt
file into Claude Desktop for instant setup.
For detailed feature documentation in Japanese, see README_ja.md
.
Fetches horse racing news from the thoroughbreddailynews.com RSS feed.
A server for web research that brings real-time information into AI models and researches any topic.
Use 3,000+ pre-built cloud tools to extract data from websites, e-commerce, social media, search engines, maps, and more
Dynamically scan and analyze potentially malicious URLs using the urlDNA.io
A fast, lightweight MCP server that empowers LLMs with browser automation via Puppeteerβs structured accessibility data, featuring optional vision mode for complex visual understanding and flexible, cross-platform configuration.
Automate web browsers and perform web scraping tasks using the Playwright framework.
Fast, token-efficient web content extraction that converts websites to clean Markdown. Features Mozilla Readability, smart caching, polite crawling with robots.txt support, and concurrent fetching with minimal dependencies.
Fetch the content of a remote URL as Markdown with Jina Reader.
Interact with WebScraping.AI for web data extraction and scraping.
Hyperbrowser is the next-generation platform empowering AI agents and enabling effortless, scalable browser automation.