Open Crawler MCP Server
A web crawler and text extractor with robots.txt compliance, rate limiting, and page size protection.
Open Crawler MCP Server
A Model Context Protocol (MCP) server for web crawling and content extraction from web pages with multiple output formats.
Features
- Multiple Output Formats: Extract content as text, markdown, structured XML, or JSON
- Smart Content Extraction: CSS selector support for targeted content extraction
- Robots.txt Compliance: Automatic robots.txt checking and compliance
- Rate Limiting: Built-in rate limiting (1 second minimum between requests)
- Size Protection: Maximum page size limit (10MB) to prevent memory issues
- Structured Content: Extract headings, paragraphs, links, images, and lists separately
- Error Handling: Comprehensive error codes for different failure scenarios
MCP Client Configuration
Add this server to your MCP client configuration:
{
"mcpServers": {
"open-crawler": {
"command": "npx",
"args": ["@elchika-inc/open-crawler-mcp-server"]
}
}
}
Available Tools
crawl_page
Extracts content from a web page in multiple formats with automatic robots.txt compliance checking.
Parameters:
url
(required): Target URL to crawlselector
(optional): CSS selector for specific content extractionformat
(optional): Output format -text
,markdown
,xml
, orjson
(default:text
)text_only
(optional): Legacy parameter for text-only extraction (deprecated, useformat
instead)
Output Formats:
text
: Clean, plain text content with whitespace normalizedmarkdown
: Well-formatted Markdown with headings, links, images, and lists preservedxml
: Structured XML with separate sections for headings, paragraphs, links, images, and listsjson
: Structured JSON object containing categorized content elements
Examples:
Basic text extraction:
{
"name": "crawl_page",
"arguments": {
"url": "https://example.com",
"format": "text"
}
}
Markdown extraction with CSS selector:
{
"name": "crawl_page",
"arguments": {
"url": "https://example.com",
"selector": "article",
"format": "markdown"
}
}
Structured JSON extraction:
{
"name": "crawl_page",
"arguments": {
"url": "https://example.com",
"format": "json"
}
}
check_robots
Validates if a URL is allowed to be crawled according to the site's robots.txt file.
Parameters:
url
(required): URL to check for crawling permission
Example:
{
"name": "check_robots",
"arguments": {
"url": "https://example.com/page"
}
}
Error Handling
Common error scenarios:
- Network connection issues
- Invalid HTML or missing content
- Robots.txt restrictions
- Request timeouts or rate limits
- Content size too large (>10MB)
License
MIT
Related Servers
Scrapezy
Turn websites into datasets with Scrapezy
Xiaohongshu Search & Comment
An automated tool to search notes, retrieve content, and post comments on Xiaohongshu (RedBook) using Playwright.
Crawl4AI MCP Server
An MCP server for advanced web crawling, content extraction, and AI-powered analysis using the crawl4ai library.
Financial Data MCP Server
Provides real-time financial market data from Yahoo Finance.
Bilibili
Interact with the Bilibili video website, enabling actions like searching for videos, retrieving video information, and accessing user data.
Yahoo Finance
Provides comprehensive financial data from Yahoo Finance, including historical prices, company info, financial statements, and market news.
Weibo
Scrape Weibo user information, feeds, and perform searches.
Puppeteer Vision
Scrape webpages and convert them to markdown using Puppeteer. Features AI-driven interaction capabilities.
Feed
A server for fetching and parsing RSS, Atom, and JSON feeds.
Firecrawl MCP
Adds powerful web scraping and search capabilities to LLM clients like Cursor and Claude.