Open Crawler MCP Server
A web crawler and content extractor that supports multiple output formats like text, markdown, and JSON.
Open Crawler MCP Server
A Model Context Protocol (MCP) server for web crawling and content extraction from web pages with multiple output formats.
Features
- Multiple Output Formats: Extract content as text, markdown, structured XML, or JSON
- Smart Content Extraction: CSS selector support for targeted content extraction
- Robots.txt Compliance: Automatic robots.txt checking and compliance
- Rate Limiting: Built-in rate limiting (1 second minimum between requests)
- Size Protection: Maximum page size limit (10MB) to prevent memory issues
- Structured Content: Extract headings, paragraphs, links, images, and lists separately
- Error Handling: Comprehensive error codes for different failure scenarios
MCP Client Configuration
Add this server to your MCP client configuration:
{
"mcpServers": {
"open-crawler": {
"command": "npx",
"args": ["@elchika-inc/open-crawler-mcp-server"]
}
}
}
Available Tools
crawl_page
Extracts content from a web page in multiple formats with automatic robots.txt compliance checking.
Parameters:
url(required): Target URL to crawlselector(optional): CSS selector for specific content extractionformat(optional): Output format -text,markdown,xml, orjson(default:text)text_only(optional): Legacy parameter for text-only extraction (deprecated, useformatinstead)
Output Formats:
text: Clean, plain text content with whitespace normalizedmarkdown: Well-formatted Markdown with headings, links, images, and lists preservedxml: Structured XML with separate sections for headings, paragraphs, links, images, and listsjson: Structured JSON object containing categorized content elements
Examples:
Basic text extraction:
{
"name": "crawl_page",
"arguments": {
"url": "https://example.com",
"format": "text"
}
}
Markdown extraction with CSS selector:
{
"name": "crawl_page",
"arguments": {
"url": "https://example.com",
"selector": "article",
"format": "markdown"
}
}
Structured JSON extraction:
{
"name": "crawl_page",
"arguments": {
"url": "https://example.com",
"format": "json"
}
}
check_robots
Validates if a URL is allowed to be crawled according to the site's robots.txt file.
Parameters:
url(required): URL to check for crawling permission
Example:
{
"name": "check_robots",
"arguments": {
"url": "https://example.com/page"
}
}
Error Handling
Common error scenarios:
- Network connection issues
- Invalid HTML or missing content
- Robots.txt restrictions
- Request timeouts or rate limits
- Content size too large (>10MB)
License
MIT
関連サーバー
Bright Data
スポンサーDiscover, extract, and interact with the web - one interface powering automated access across the public internet.
Puppeteer
A server for browser automation using Puppeteer, enabling web scraping, screenshots, and JavaScript execution.
Socialcrawl MCP
Single API key to access 21 + real time social media data
Crew Risk
A crawler compliance risk assessment system via a simple API.
ScrAPI MCP Server
A server for scraping web pages using the ScrAPI API.
MCP Browser Use Secure
A secure MCP server for browser automation with enhanced security features like multi-layered protection and session isolation.
YouTube Transcript
Fetches transcripts for YouTube videos.
Intercept
Give your AI the ability to read the web. Fetches URLs as clean markdown with 9 fallback strategies. Handles tweets, YouTube, arXiv, PDFs, and regular pages.
yt-dlp-mcp
Download video and audio from various platforms like YouTube, Facebook, and TikTok using yt-dlp.
Scrappa
Remote MCP server for Claude, Cursor, VS Code, and Windsurf with access to 80+ web scraping and data extraction APIs.
Anysite
Turn any website into an API