Web Scraper Service
A Python-based MCP server for headless web scraping. It extracts the main text content from web pages and outputs it as Markdown, text, or HTML.
Web Scrapper Service (MCP Stdin/Stdout & HTTP)
A Python-based MCP server for robust, headless web scraping—extracts main text content from web pages and outputs Markdown, text, or HTML for seamless AI and automation integration.
Key Features
- Headless browser scraping (Playwright, BeautifulSoup, Markdownify)
- Outputs Markdown, text, or HTML
- Designed for MCP (Model Context Protocol) stdio/JSON-RPC integration
- Dual transport: stdio (default) and Streamable HTTP for shared service mode
- Persistent browser pool: Chromium stays alive across requests for fast scraping
- Smart DOM wait: MutationObserver-based content stabilization instead of fixed sleep
- Dockerized, with pre-built images
- Configurable via environment variables
- Robust error handling (timeouts, HTTP errors, Cloudflare, etc.)
- Per-domain rate limiting
- Easy integration with AI tools and IDEs (Cursor, Claude Desktop, Continue, JetBrains, Zed, etc.)
- One-click install for Cursor, interactive installer for Claude
Quick Start
Run with Docker (stdio mode — one container per client)
docker run -i --rm ghcr.io/justazul/web-scrapper-stdio
Run as Shared HTTP Service (one container, multiple clients)
docker run -d --name web-scraper \
-e MCP_TRANSPORT=streamable-http \
-e MCP_HTTP_PORT=8080 \
-e BROWSER_POOL_SIZE=3 \
-p 8080:8080 \
--shm-size=3gb \
ghcr.io/justazul/web-scrapper-stdio
Or with Docker Compose:
docker compose --profile service up -d
One-Click Installation (Cursor IDE)
Transport Modes
stdio (default)
Each MCP client spawns its own container via docker run -i. Simple, zero configuration, works with any MCP client.
{
"mcpServers": {
"web-scrapper-stdio": {
"command": "docker",
"args": ["run", "-i", "--rm", "ghcr.io/justazul/web-scrapper-stdio"]
}
}
}
Streamable HTTP (shared service)
Run one persistent container that serves multiple MCP clients over HTTP. Saves resources when running multiple AI tool instances (e.g., multiple Claude Code sessions).
Start the service:
docker run -d --name web-scraper \
-e MCP_TRANSPORT=streamable-http \
-e MCP_HTTP_PORT=8080 \
-p 8080:8080 \
--shm-size=3gb \
ghcr.io/justazul/web-scrapper-stdio
Connect from your MCP client:
{
"mcpServers": {
"web-scrapper": {
"url": "http://localhost:8080/mcp"
}
}
}
Integration with AI Tools & IDEs
This service supports integration with a wide range of AI tools and IDEs that implement the Model Context Protocol (MCP). Below are ready-to-use configuration examples for the most popular environments. Replace the image/tag as needed for custom builds.
Cursor IDE
Add to your .cursor/mcp.json (project-level) or ~/.cursor/mcp.json (global):
{
"mcpServers": {
"web-scrapper-stdio": {
"command": "docker",
"args": [
"run",
"-i",
"--rm",
"ghcr.io/justazul/web-scrapper-stdio"
]
}
}
}
Claude Desktop
Add to your Claude Desktop MCP config (typically claude_desktop_config.json):
{
"mcpServers": {
"web-scrapper-stdio": {
"command": "docker",
"args": [
"run",
"-i",
"--rm",
"ghcr.io/justazul/web-scrapper-stdio"
]
}
}
}
Claude Code
Add to your .mcp.json or global ~/.claude.json:
stdio mode (one container per session):
{
"mcpServers": {
"web-scrapper-stdio": {
"command": "docker",
"args": ["run", "-i", "--rm", "ghcr.io/justazul/web-scrapper-stdio"]
}
}
}
HTTP mode (shared service — start the service first):
{
"mcpServers": {
"web-scrapper": {
"url": "http://localhost:8080/mcp"
}
}
}
Continue (VSCode/JetBrains Plugin)
Add to your continue.config.json or via the Continue plugin MCP settings:
{
"mcpServers": {
"web-scrapper-stdio": {
"command": "docker",
"args": [
"run",
"-i",
"--rm",
"ghcr.io/justazul/web-scrapper-stdio"
]
}
}
}
IntelliJ IDEA (JetBrains AI Assistant)
Go to Settings > Tools > AI Assistant > Model Context Protocol (MCP) and add a new server. Use:
{
"command": "docker",
"args": [
"run",
"-i",
"--rm",
"ghcr.io/justazul/web-scrapper-stdio"
]
}
Zed Editor
Add to your Zed MCP config (see Zed docs for the exact path):
{
"mcpServers": {
"web-scrapper-stdio": {
"command": "docker",
"args": [
"run",
"-i",
"--rm",
"ghcr.io/justazul/web-scrapper-stdio"
]
}
}
}
Usage
MCP Server (Tool/Prompt)
This web scrapper is used as an MCP (Model Context Protocol) tool, allowing it to be used by AI models or other automation directly.
Tool: scrape_web
Parameters:
url(string, required): The URL to scrapemax_length(integer, optional): Maximum length of returned content (default: unlimited)timeout_seconds(integer, optional): Timeout in seconds for the page load (default: 30)user_agent(string, optional): Custom User-Agent string passed directly to the browser (defaults to a random agent)wait_for_network_idle(boolean, optional): Wait for network activity to settle before scraping (default: true)custom_elements_to_remove(list of strings, optional): Additional HTML elements (CSS selectors) to remove before extractiongrace_period_seconds(float, optional): Time to wait for JS rendering after navigation. Uses MutationObserver for smart detection. Set to 0 to skip entirely. (default: 0.5)output_format(string, optional):markdown,text, orhtml(default:markdown)click_selector(string, optional): If provided, click the element matching this selector after navigation and before extraction
Returns:
- Markdown formatted content extracted from the webpage, as a string
- Errors are reported as strings starting with
[ERROR] ...
Example: Using click_selector and custom_elements_to_remove
{
"url": "http://uitestingplayground.com/clientdelay",
"click_selector": "#ajaxButton",
"grace_period_seconds": 10,
"custom_elements_to_remove": [".ads-banner", "#popup"],
"output_format": "markdown"
}
Prompt: scrape
Parameters:
url(string, required): The URL to scrapeoutput_format(string, optional):markdown,text, orhtml(default:markdown)
Returns:
- Content extracted from the webpage in the chosen format
Note:
- Markdown is returned by default but text or HTML can be requested via
output_format. - The scrapper does not check robots.txt and will attempt to fetch any URL provided.
- No REST API or CLI tool is included; this is a pure MCP stdio/JSON-RPC tool.
- The scrapper always extracts the full
<body>content of web pages, applying only essential noise removal (removing script, style, nav, footer, aside, header, and similar non-content tags). The scrapper detects and handles Cloudflare challenge screens, returning a specific error string.
Configuration
You can override most configuration options using environment variables:
Core Settings
DEFAULT_TIMEOUT_SECONDS: Timeout for page loads and navigation (default: 30)DEFAULT_MIN_CONTENT_LENGTH: Minimum content length for extracted text (default: 100)DEFAULT_MIN_CONTENT_LENGTH_SEARCH_APP: Minimum content length for search.app domains (default: 30)DEFAULT_MIN_SECONDS_BETWEEN_REQUESTS: Minimum delay between requests to the same domain (default: 2)DEFAULT_GRACE_PERIOD_SECONDS: Default grace period for JS rendering (default: 0.5)DEBUG_LOGS_ENABLED: Set totrueto enable debug-level logs (default:false)
Browser Pool
BROWSER_POOL_ENABLED: Enable persistent browser pool (default:true). Set tofalsefor per-request browser launch (original behavior).BROWSER_POOL_SIZE: Number of Chromium instances to keep alive (default: 2). Each instance uses ~100-200MB RAM.
Transport
MCP_TRANSPORT: Transport mode —stdioorstreamable-http(default:stdio)MCP_HTTP_PORT: HTTP server port when using streamable-http transport (default: 8080)MCP_HTTP_HOST: HTTP server bind address (default:0.0.0.0)
Cloudflare Bypass
CAPTCHA_API_KEY: API key for captcha solver service. When set, Cloudflare Turnstile challenges are solved automatically. When empty (default), CF-protected pages return an error.CAPTCHA_PROVIDER: Captcha solver provider —2captcha,capsolver, orcapmonster(default:2captcha)CAPTCHA_BASE_URL: Custom solver API endpoint (default: uses provider's official URL)CAPTCHA_TIMEOUT: Timeout in seconds for captcha solving (default: 120)
Test Settings
DEFAULT_TEST_REQUEST_TIMEOUT: Timeout for test requests (default: 10)DEFAULT_TEST_NO_DELAY_THRESHOLD: Threshold for skipping artificial delays in tests (default: 0.5)
Error Handling & Limitations
- The scrapper detects and returns errors for navigation failures, timeouts, HTTP errors (including 404), and Cloudflare anti-bot challenges.
- Rate limiting is enforced per domain (default: 2 seconds between requests).
- Cloudflare bypass: Uses Patchright (CDP-level anti-detection) for passive evasion. Most CF-protected sites are scraped without triggering a challenge. When a Turnstile challenge is triggered and
CAPTCHA_API_KEYis set, it's solved automatically via third-party API. - Limitations:
- No REST API or CLI tool (MCP stdio/JSON-RPC only)
- No support for non-HTML content (PDF, images, etc.)
- No authentication or session management for protected pages
- Not intended for scraping at scale or violating site terms
Development & Testing
Running Tests (Docker Compose)
All tests must be run using Docker Compose. Do not run tests outside Docker.
- All tests:
docker compose up --build --abort-on-container-exit test - MCP server tests only:
docker compose up --build --abort-on-container-exit test_mcp - Scrapper tests only:
docker compose up --build --abort-on-container-exit test_scrapper
Running Benchmarks
docker compose run --rm benchmark
Results are stored in benchmarks/RESULTS.md.
Contributing
Contributions are welcome! Please open issues or pull requests for bug fixes, features, or improvements. If you plan to make significant changes, open an issue first to discuss your proposal.
License
This project is licensed under the MIT License.
Related Servers
Bright Data
sponsorDiscover, extract, and interact with the web - one interface powering automated access across the public internet.
Crawl4AI RAG
Integrate web crawling and Retrieval-Augmented Generation (RAG) into AI agents and coding assistants.
MCP Video Download URL Parser
Download watermark-free videos from platforms like Douyin and TikTok.
Webclaw
Web content extraction for LLM pipelines — clean markdown or structured JSON from any URL using browser-grade TLS fingerprinting, no headless browser required. CLI, REST API, and MCP server.
Lightpanda Go MCP server
A Go-based MCP server for interacting with the Lightpanda Browser using the Chrome DevTools Protocol (CDP).
Website to Markdown MCP Server
Fetches and converts website content to Markdown with AI-powered cleanup, OpenAPI support, and stealth browsing.
YouTube Translate MCP
Access YouTube video transcripts and translations using the YouTube Translate API.
Plain Markdown
Convert any URL to clean, LLM-ready Markdown
SearchMCP
Connect any LLM to the internet with the cheapest, most reliable, and developer-friendly search API.
Urlbox Full Page Screenshots
An MCP server for the Urlbox Screenshot API. It enables your client to take screenshots, generate PDFs, extract HTML/markdown, and more from websites.
Fetcher MCP
Fetch and extract web content using a Playwright headless browser, with support for intelligent extraction and flexible output.