Open Crawler MCP Server
A web crawler and text extractor with robots.txt compliance, rate limiting, and page size protection.
Open Crawler MCP Server
A Model Context Protocol (MCP) server for web crawling and content extraction from web pages with multiple output formats.
Features
- Multiple Output Formats: Extract content as text, markdown, structured XML, or JSON
- Smart Content Extraction: CSS selector support for targeted content extraction
- Robots.txt Compliance: Automatic robots.txt checking and compliance
- Rate Limiting: Built-in rate limiting (1 second minimum between requests)
- Size Protection: Maximum page size limit (10MB) to prevent memory issues
- Structured Content: Extract headings, paragraphs, links, images, and lists separately
- Error Handling: Comprehensive error codes for different failure scenarios
MCP Client Configuration
Add this server to your MCP client configuration:
{
"mcpServers": {
"open-crawler": {
"command": "npx",
"args": ["@elchika-inc/open-crawler-mcp-server"]
}
}
}
Available Tools
crawl_page
Extracts content from a web page in multiple formats with automatic robots.txt compliance checking.
Parameters:
url(required): Target URL to crawlselector(optional): CSS selector for specific content extractionformat(optional): Output format -text,markdown,xml, orjson(default:text)text_only(optional): Legacy parameter for text-only extraction (deprecated, useformatinstead)
Output Formats:
text: Clean, plain text content with whitespace normalizedmarkdown: Well-formatted Markdown with headings, links, images, and lists preservedxml: Structured XML with separate sections for headings, paragraphs, links, images, and listsjson: Structured JSON object containing categorized content elements
Examples:
Basic text extraction:
{
"name": "crawl_page",
"arguments": {
"url": "https://example.com",
"format": "text"
}
}
Markdown extraction with CSS selector:
{
"name": "crawl_page",
"arguments": {
"url": "https://example.com",
"selector": "article",
"format": "markdown"
}
}
Structured JSON extraction:
{
"name": "crawl_page",
"arguments": {
"url": "https://example.com",
"format": "json"
}
}
check_robots
Validates if a URL is allowed to be crawled according to the site's robots.txt file.
Parameters:
url(required): URL to check for crawling permission
Example:
{
"name": "check_robots",
"arguments": {
"url": "https://example.com/page"
}
}
Error Handling
Common error scenarios:
- Network connection issues
- Invalid HTML or missing content
- Robots.txt restrictions
- Request timeouts or rate limits
- Content size too large (>10MB)
License
MIT
Похожие серверы
Bright Data
спонсорDiscover, extract, and interact with the web - one interface powering automated access across the public internet.
Haunt API
Extract clean, structured data from any URL — directly from Claude, Cursor, or any MCP-compatible AI.
MCP Webscan Server
Fetch, analyze, and extract information from web pages.
MCP Browser Console Capture Service
A browser automation service for capturing console output, useful for tasks like public sentiment analysis.
Scrappa
Remote MCP server for Claude, Cursor, VS Code, and Windsurf with access to 80+ web scraping and data extraction APIs.
Web Fetch
Fetches and converts web content, ideal for data extraction and web scraping.
Wayback Machine
Access the Internet Archive's Wayback Machine to retrieve archived web pages and check for available snapshots of URLs.
brosh
A browser screenshot tool to capture scrolling screenshots of webpages using Playwright, with support for intelligent section identification and multiple output formats.
Extract Developer & LLM Docs
Extract documentation for AI agents from any site with llms.txt support. Features MCP server, REST API, batch processing, and multiple export formats.
Finance MCP Server
Stock prices, cryptocurrency data, exchange rates, and portfolio tracking
Tech Collector MCP
Collects and summarizes technical articles from sources like Qiita, Dev.to, NewsAPI, and Hacker News using the Gemini API.