Open Crawler MCP Server
A web crawler and text extractor with robots.txt compliance, rate limiting, and page size protection.
Open Crawler MCP Server
A Model Context Protocol (MCP) server for web crawling and content extraction from web pages with multiple output formats.
Features
- Multiple Output Formats: Extract content as text, markdown, structured XML, or JSON
- Smart Content Extraction: CSS selector support for targeted content extraction
- Robots.txt Compliance: Automatic robots.txt checking and compliance
- Rate Limiting: Built-in rate limiting (1 second minimum between requests)
- Size Protection: Maximum page size limit (10MB) to prevent memory issues
- Structured Content: Extract headings, paragraphs, links, images, and lists separately
- Error Handling: Comprehensive error codes for different failure scenarios
MCP Client Configuration
Add this server to your MCP client configuration:
{
"mcpServers": {
"open-crawler": {
"command": "npx",
"args": ["@elchika-inc/open-crawler-mcp-server"]
}
}
}
Available Tools
crawl_page
Extracts content from a web page in multiple formats with automatic robots.txt compliance checking.
Parameters:
url(required): Target URL to crawlselector(optional): CSS selector for specific content extractionformat(optional): Output format -text,markdown,xml, orjson(default:text)text_only(optional): Legacy parameter for text-only extraction (deprecated, useformatinstead)
Output Formats:
text: Clean, plain text content with whitespace normalizedmarkdown: Well-formatted Markdown with headings, links, images, and lists preservedxml: Structured XML with separate sections for headings, paragraphs, links, images, and listsjson: Structured JSON object containing categorized content elements
Examples:
Basic text extraction:
{
"name": "crawl_page",
"arguments": {
"url": "https://example.com",
"format": "text"
}
}
Markdown extraction with CSS selector:
{
"name": "crawl_page",
"arguments": {
"url": "https://example.com",
"selector": "article",
"format": "markdown"
}
}
Structured JSON extraction:
{
"name": "crawl_page",
"arguments": {
"url": "https://example.com",
"format": "json"
}
}
check_robots
Validates if a URL is allowed to be crawled according to the site's robots.txt file.
Parameters:
url(required): URL to check for crawling permission
Example:
{
"name": "check_robots",
"arguments": {
"url": "https://example.com/page"
}
}
Error Handling
Common error scenarios:
- Network connection issues
- Invalid HTML or missing content
- Robots.txt restrictions
- Request timeouts or rate limits
- Content size too large (>10MB)
License
MIT
İlgili Sunucular
Bright Data
sponsorDiscover, extract, and interact with the web - one interface powering automated access across the public internet.
Humanizer PRO
Humanizer PRO turn AI content into Human written content undetectable and bypass all AI detectors.
CrawlForge MCP
CrawlForge MCP is a production-ready MCP server with 18 web scraping tools for AI agents. It gives Claude, Cursor, and any MCP-compatible client the ability to fetch URLs, extract structured data with CSS/XPath selectors, run deep multi-step research, bypass anti-bot detection with TLS fingerprint randomization, process documents, monitor page changes, and more. Credit-based pricing with a free tier (1,000 credits/month, no credit card required).
scrape-do-mcp
MCP Server for Scrape.do - Web Scraping & Google Search with anti-bot bypass
YouTube Transcript MCP Server
A high-performance MCP server for fetching YouTube video transcripts, with support for caching, rate limiting, and proxy rotation.
HotNews MCP Server
Provides real-time hot trending topics from major Chinese social platforms and news sites.
Google Flights
An MCP server to interact with Google Flights data for finding flight information.
Read URL MCP
Extracts web content from a URL and converts it to clean Markdown format.
Playwright SSE MCP Server
An MCP server that provides Playwright features for web scraping and browser automation.
Safari MCP
Native Safari browser automation for AI agents — 80 tools via AppleScript, zero Chrome overhead, keeps logins. macOS only.
MCP Undetected Chromedriver
Automate Chrome browser control while bypassing anti-bot detection using undetected-chromedriver.