Open Crawler MCP Server
A web crawler and text extractor with robots.txt compliance, rate limiting, and page size protection.
Open Crawler MCP Server
A Model Context Protocol (MCP) server for web crawling and content extraction from web pages with multiple output formats.
Features
- Multiple Output Formats: Extract content as text, markdown, structured XML, or JSON
- Smart Content Extraction: CSS selector support for targeted content extraction
- Robots.txt Compliance: Automatic robots.txt checking and compliance
- Rate Limiting: Built-in rate limiting (1 second minimum between requests)
- Size Protection: Maximum page size limit (10MB) to prevent memory issues
- Structured Content: Extract headings, paragraphs, links, images, and lists separately
- Error Handling: Comprehensive error codes for different failure scenarios
MCP Client Configuration
Add this server to your MCP client configuration:
{
"mcpServers": {
"open-crawler": {
"command": "npx",
"args": ["@elchika-inc/open-crawler-mcp-server"]
}
}
}
Available Tools
crawl_page
Extracts content from a web page in multiple formats with automatic robots.txt compliance checking.
Parameters:
url(required): Target URL to crawlselector(optional): CSS selector for specific content extractionformat(optional): Output format -text,markdown,xml, orjson(default:text)text_only(optional): Legacy parameter for text-only extraction (deprecated, useformatinstead)
Output Formats:
text: Clean, plain text content with whitespace normalizedmarkdown: Well-formatted Markdown with headings, links, images, and lists preservedxml: Structured XML with separate sections for headings, paragraphs, links, images, and listsjson: Structured JSON object containing categorized content elements
Examples:
Basic text extraction:
{
"name": "crawl_page",
"arguments": {
"url": "https://example.com",
"format": "text"
}
}
Markdown extraction with CSS selector:
{
"name": "crawl_page",
"arguments": {
"url": "https://example.com",
"selector": "article",
"format": "markdown"
}
}
Structured JSON extraction:
{
"name": "crawl_page",
"arguments": {
"url": "https://example.com",
"format": "json"
}
}
check_robots
Validates if a URL is allowed to be crawled according to the site's robots.txt file.
Parameters:
url(required): URL to check for crawling permission
Example:
{
"name": "check_robots",
"arguments": {
"url": "https://example.com/page"
}
}
Error Handling
Common error scenarios:
- Network connection issues
- Invalid HTML or missing content
- Robots.txt restrictions
- Request timeouts or rate limits
- Content size too large (>10MB)
License
MIT
関連サーバー
Bright Data
スポンサーDiscover, extract, and interact with the web - one interface powering automated access across the public internet.
MCP FetchPage
Intelligent web page fetching with automatic cookie support and CSS selector extraction.
MCP Deep Web Research Server
An advanced web research server with intelligent search queuing, enhanced content extraction, and deep research capabilities.
Instagram Downloader
A server to download videos and media from Instagram.
JinaAI Reader
Extracts web content using the Jina.ai Reader API.
Selenix MCP
MCP server that bridges Claude Desktop (or any other local app supporting MCP) with Selenix for browser automation and testing. Enables creating, running, debugging, and managing browser automation tests through natural language.
LinkedIn MCP
Scrape LinkedIn profiles and companies, get recommended jobs, and perform job searches.
SteadyFetch
Reliable web fetching for AI agents with retry, circuit breaker, caching, and anti-bot bypass
webcheck-mcp
Website health checker MCP server - SEO audit, accessibility scan, broken link detection, performance analysis, and page comparison.
comet-mcp
Connect Claude Code to Perplexity Comet browser for agentic web browsing, deep research, and real-time task monitoring
Web Scout
An MCP server for web search and content extraction using DuckDuckGo.