Open Crawler MCP Server
A web crawler and text extractor with robots.txt compliance, rate limiting, and page size protection.
Open Crawler MCP Server
A Model Context Protocol (MCP) server for web crawling and content extraction from web pages with multiple output formats.
Features
- Multiple Output Formats: Extract content as text, markdown, structured XML, or JSON
- Smart Content Extraction: CSS selector support for targeted content extraction
- Robots.txt Compliance: Automatic robots.txt checking and compliance
- Rate Limiting: Built-in rate limiting (1 second minimum between requests)
- Size Protection: Maximum page size limit (10MB) to prevent memory issues
- Structured Content: Extract headings, paragraphs, links, images, and lists separately
- Error Handling: Comprehensive error codes for different failure scenarios
MCP Client Configuration
Add this server to your MCP client configuration:
{
"mcpServers": {
"open-crawler": {
"command": "npx",
"args": ["@elchika-inc/open-crawler-mcp-server"]
}
}
}
Available Tools
crawl_page
Extracts content from a web page in multiple formats with automatic robots.txt compliance checking.
Parameters:
url(required): Target URL to crawlselector(optional): CSS selector for specific content extractionformat(optional): Output format -text,markdown,xml, orjson(default:text)text_only(optional): Legacy parameter for text-only extraction (deprecated, useformatinstead)
Output Formats:
text: Clean, plain text content with whitespace normalizedmarkdown: Well-formatted Markdown with headings, links, images, and lists preservedxml: Structured XML with separate sections for headings, paragraphs, links, images, and listsjson: Structured JSON object containing categorized content elements
Examples:
Basic text extraction:
{
"name": "crawl_page",
"arguments": {
"url": "https://example.com",
"format": "text"
}
}
Markdown extraction with CSS selector:
{
"name": "crawl_page",
"arguments": {
"url": "https://example.com",
"selector": "article",
"format": "markdown"
}
}
Structured JSON extraction:
{
"name": "crawl_page",
"arguments": {
"url": "https://example.com",
"format": "json"
}
}
check_robots
Validates if a URL is allowed to be crawled according to the site's robots.txt file.
Parameters:
url(required): URL to check for crawling permission
Example:
{
"name": "check_robots",
"arguments": {
"url": "https://example.com/page"
}
}
Error Handling
Common error scenarios:
- Network connection issues
- Invalid HTML or missing content
- Robots.txt restrictions
- Request timeouts or rate limits
- Content size too large (>10MB)
License
MIT
Máy chủ liên quan
Bright Data
nhà tài trợDiscover, extract, and interact with the web - one interface powering automated access across the public internet.
Nefino
Access the Nefino renewable energy news API.
rippr
YouTube transcript extraction for AI agents. Clean text, timestamps, or structured JSON from any video. No API keys required.
CodingBaby Browser
A Node.js server that enables AI assistants to control the Chrome browser via WebSocket. Requires the CodingBaby Chrome Extension.
Horse Racing News
Fetches horse racing news from the thoroughbreddailynews.com RSS feed.
open-sales-stack
Collection of B2B sales intelligence MCP servers. Includes website analysis, tech stack detection, hiring signals, review aggregation, ad tracking, social profiles, financial reporting and more for AI-powered prospecting
APIMesh
18 x402-payable web analysis APIs for AI agents — pay per call with USDC on Base, no API keys needed
Cloudflare Playwright
Control a browser for web automation tasks using Playwright on Cloudflare Workers.
Leporello
Remote MCP for Opera & Classical Music Event Schedules
1001Proxy - Proxy MCP Server for AI Agents
Use Claude, OpenAI Cursor, and any MCP-compatible AI agent to buy and manage proxies using natural language. No custom integrations needed - simply connect your client to the server and start chatting.
B2Proxy
1GB Free Trial, World's Leading Proxy Service Platform, Efficient Data Collection