Open Crawler MCP Server
A web crawler and content extractor that supports multiple output formats like text, markdown, and JSON.
Open Crawler MCP Server
A Model Context Protocol (MCP) server for web crawling and content extraction from web pages with multiple output formats.
Features
- Multiple Output Formats: Extract content as text, markdown, structured XML, or JSON
- Smart Content Extraction: CSS selector support for targeted content extraction
- Robots.txt Compliance: Automatic robots.txt checking and compliance
- Rate Limiting: Built-in rate limiting (1 second minimum between requests)
- Size Protection: Maximum page size limit (10MB) to prevent memory issues
- Structured Content: Extract headings, paragraphs, links, images, and lists separately
- Error Handling: Comprehensive error codes for different failure scenarios
MCP Client Configuration
Add this server to your MCP client configuration:
{
"mcpServers": {
"open-crawler": {
"command": "npx",
"args": ["@elchika-inc/open-crawler-mcp-server"]
}
}
}
Available Tools
crawl_page
Extracts content from a web page in multiple formats with automatic robots.txt compliance checking.
Parameters:
url(required): Target URL to crawlselector(optional): CSS selector for specific content extractionformat(optional): Output format -text,markdown,xml, orjson(default:text)text_only(optional): Legacy parameter for text-only extraction (deprecated, useformatinstead)
Output Formats:
text: Clean, plain text content with whitespace normalizedmarkdown: Well-formatted Markdown with headings, links, images, and lists preservedxml: Structured XML with separate sections for headings, paragraphs, links, images, and listsjson: Structured JSON object containing categorized content elements
Examples:
Basic text extraction:
{
"name": "crawl_page",
"arguments": {
"url": "https://example.com",
"format": "text"
}
}
Markdown extraction with CSS selector:
{
"name": "crawl_page",
"arguments": {
"url": "https://example.com",
"selector": "article",
"format": "markdown"
}
}
Structured JSON extraction:
{
"name": "crawl_page",
"arguments": {
"url": "https://example.com",
"format": "json"
}
}
check_robots
Validates if a URL is allowed to be crawled according to the site's robots.txt file.
Parameters:
url(required): URL to check for crawling permission
Example:
{
"name": "check_robots",
"arguments": {
"url": "https://example.com/page"
}
}
Error Handling
Common error scenarios:
- Network connection issues
- Invalid HTML or missing content
- Robots.txt restrictions
- Request timeouts or rate limits
- Content size too large (>10MB)
License
MIT
相關伺服器
Bright Data
贊助Discover, extract, and interact with the web - one interface powering automated access across the public internet.
YouTube
Fetch YouTube subtitles
WebforAI Text Extractor
Extracts plain text from web pages using WebforAI.
HasData
HasData APIs - Google SERP, Amazon, Zillow, Indeed, Maps, and more
Horse Racing News
Fetches horse racing news from the thoroughbreddailynews.com RSS feed.
ShopGraph
Structured product data from the open web — where platform APIs don't reach. Schema.org + AI extraction. Pay per call via Stripe MPP.
Browser MCP
A fast, lightweight MCP server that empowers LLMs with browser automation via Puppeteer’s structured accessibility data, featuring optional vision mode for complex visual understanding and flexible, cross-platform configuration.
B2Proxy
1GB Free Trial, World's Leading Proxy Service Platform, Efficient Data Collection
Icypeas
Icypeas MCP allows agents to source leads in a gigantic +575M lead database, and enrich these leads with emails
WebSearch
A web search and content extraction tool using the Firecrawl API for advanced web scraping, searching, and content analysis.
neo-vision
Spatial DOM maps for AI agent browser navigation with anti-bot stealth and human-like behavioral simulation