Open Crawler MCP Server
A web crawler and content extractor that supports multiple output formats like text, markdown, and JSON.
Open Crawler MCP Server
A Model Context Protocol (MCP) server for web crawling and content extraction from web pages with multiple output formats.
Features
- Multiple Output Formats: Extract content as text, markdown, structured XML, or JSON
- Smart Content Extraction: CSS selector support for targeted content extraction
- Robots.txt Compliance: Automatic robots.txt checking and compliance
- Rate Limiting: Built-in rate limiting (1 second minimum between requests)
- Size Protection: Maximum page size limit (10MB) to prevent memory issues
- Structured Content: Extract headings, paragraphs, links, images, and lists separately
- Error Handling: Comprehensive error codes for different failure scenarios
MCP Client Configuration
Add this server to your MCP client configuration:
{
"mcpServers": {
"open-crawler": {
"command": "npx",
"args": ["@elchika-inc/open-crawler-mcp-server"]
}
}
}
Available Tools
crawl_page
Extracts content from a web page in multiple formats with automatic robots.txt compliance checking.
Parameters:
url(required): Target URL to crawlselector(optional): CSS selector for specific content extractionformat(optional): Output format -text,markdown,xml, orjson(default:text)text_only(optional): Legacy parameter for text-only extraction (deprecated, useformatinstead)
Output Formats:
text: Clean, plain text content with whitespace normalizedmarkdown: Well-formatted Markdown with headings, links, images, and lists preservedxml: Structured XML with separate sections for headings, paragraphs, links, images, and listsjson: Structured JSON object containing categorized content elements
Examples:
Basic text extraction:
{
"name": "crawl_page",
"arguments": {
"url": "https://example.com",
"format": "text"
}
}
Markdown extraction with CSS selector:
{
"name": "crawl_page",
"arguments": {
"url": "https://example.com",
"selector": "article",
"format": "markdown"
}
}
Structured JSON extraction:
{
"name": "crawl_page",
"arguments": {
"url": "https://example.com",
"format": "json"
}
}
check_robots
Validates if a URL is allowed to be crawled according to the site's robots.txt file.
Parameters:
url(required): URL to check for crawling permission
Example:
{
"name": "check_robots",
"arguments": {
"url": "https://example.com/page"
}
}
Error Handling
Common error scenarios:
- Network connection issues
- Invalid HTML or missing content
- Robots.txt restrictions
- Request timeouts or rate limits
- Content size too large (>10MB)
License
MIT
Servidores relacionados
Bright Data
patrocinadorDiscover, extract, and interact with the web - one interface powering automated access across the public internet.
Jina Reader
Fetch the content of a remote URL as Markdown with Jina Reader.
Cloudflare Playwright
Control a browser for web automation tasks like navigation, typing, clicking, and taking screenshots using Playwright on Cloudflare Workers.
Read Website Fast
Fast, token-efficient web content extraction that converts websites to clean Markdown. Features Mozilla Readability, smart caching, polite crawling with robots.txt support, and concurrent fetching with minimal dependencies.
MCP YouTube Extract
Extracts information from YouTube videos and channels using the YouTube Data API.
MCP Server Collector
Discovers and collects MCP servers from the internet.
Scrapfly
Scrapfly MCP Server gives AI agents a simple, unified way to scrape live web data with built-in anti-bot handling.
Instagram Downloader
A server to download videos and media from Instagram.
MCP Browser Use Secure
A secure MCP server for browser automation with enhanced security features like multi-layered protection and session isolation.
Rapidproxy
Over 70M+ premium IPs via Rapidproxy - Enjoy easy data extraction, avoiding CAPTCHAs, IP blocks with 220+ locations targeting, non-expiring traffic.
CarDeals-MCP
A Model Context Protocol (MCP) service that indexes and queries car-deal contexts - fast, flexible search for vehicle listings and marketplace data.