Open Crawler MCP Server
A web crawler and content extractor that supports multiple output formats like text, markdown, and JSON.
Open Crawler MCP Server
A Model Context Protocol (MCP) server for web crawling and content extraction from web pages with multiple output formats.
Features
- Multiple Output Formats: Extract content as text, markdown, structured XML, or JSON
- Smart Content Extraction: CSS selector support for targeted content extraction
- Robots.txt Compliance: Automatic robots.txt checking and compliance
- Rate Limiting: Built-in rate limiting (1 second minimum between requests)
- Size Protection: Maximum page size limit (10MB) to prevent memory issues
- Structured Content: Extract headings, paragraphs, links, images, and lists separately
- Error Handling: Comprehensive error codes for different failure scenarios
MCP Client Configuration
Add this server to your MCP client configuration:
{
"mcpServers": {
"open-crawler": {
"command": "npx",
"args": ["@elchika-inc/open-crawler-mcp-server"]
}
}
}
Available Tools
crawl_page
Extracts content from a web page in multiple formats with automatic robots.txt compliance checking.
Parameters:
url(required): Target URL to crawlselector(optional): CSS selector for specific content extractionformat(optional): Output format -text,markdown,xml, orjson(default:text)text_only(optional): Legacy parameter for text-only extraction (deprecated, useformatinstead)
Output Formats:
text: Clean, plain text content with whitespace normalizedmarkdown: Well-formatted Markdown with headings, links, images, and lists preservedxml: Structured XML with separate sections for headings, paragraphs, links, images, and listsjson: Structured JSON object containing categorized content elements
Examples:
Basic text extraction:
{
"name": "crawl_page",
"arguments": {
"url": "https://example.com",
"format": "text"
}
}
Markdown extraction with CSS selector:
{
"name": "crawl_page",
"arguments": {
"url": "https://example.com",
"selector": "article",
"format": "markdown"
}
}
Structured JSON extraction:
{
"name": "crawl_page",
"arguments": {
"url": "https://example.com",
"format": "json"
}
}
check_robots
Validates if a URL is allowed to be crawled according to the site's robots.txt file.
Parameters:
url(required): URL to check for crawling permission
Example:
{
"name": "check_robots",
"arguments": {
"url": "https://example.com/page"
}
}
Error Handling
Common error scenarios:
- Network connection issues
- Invalid HTML or missing content
- Robots.txt restrictions
- Request timeouts or rate limits
- Content size too large (>10MB)
License
MIT
관련 서버
Bright Data
스폰서Discover, extract, and interact with the web - one interface powering automated access across the public internet.
DOMShell
Browse the web with filesystem commands. 38 MCP tools let AI agents ls, cd, grep, click, and type through Chrome via a Chrome Extension.
transcriptor-mcp
An MCP server (stdio + HTTP/SSE) that fetches video transcripts/subtitles via yt-dlp, with pagination for large responses. Supports YouTube, Twitter/X, Instagram, TikTok, Twitch, Vimeo, Facebook, Bilibili, VK, Dailymotion. Whisper fallback — transcribes audio when subtitles are unavailable (local or OpenAI API). Works with Cursor and other MCP host
CrawlAPI
Scrape any URL with JavaScript rendering and get back clean markdown — built for AI agents, LLM pipelines, and autonomous research workflows.
youtube-summarize
MCP server that fetches YouTube video transcripts and summarizes them using your LLM client
Document Extractor MCP Server
Extracts document content from Microsoft Learn and GitHub URLs and stores it in PocketBase for retrieval and search.
Bilibili
Interact with the Bilibili video website, enabling actions like searching for videos, retrieving video information, and accessing user data.
Playwright Server
A server for browser automation using the Playwright library.
Sports Trading Card Agent
Real-time sports card pricing, market analysis, arbitrage detection, grading ROI, investment advice, and player stats (NBA/NFL/MLB). 9 tools for AI agents helping collectors and investors.
JinaAI Reader
Extracts web content using the Jina.ai Reader API.
getsonar-mcp
KYB due diligence, competitive intelligence, and strategic accounts research for AI agents. KYB covers corporate registries, sanctions screening (OFAC/UN/EU/UK), FCA Register lookup, directors, UBOs, and adverse media. CI covers pricing, hiring, product, sentiment, and corporate events. New accounts get 3 free trial calls. Pay-per-call after: $0.40 per /findings, $2.00 per /run with Claude analysis. Top up from $20 at getsonar.report.