Open Crawler MCP Server
A web crawler and content extractor that supports multiple output formats like text, markdown, and JSON.
Open Crawler MCP Server
A Model Context Protocol (MCP) server for web crawling and content extraction from web pages with multiple output formats.
Features
- Multiple Output Formats: Extract content as text, markdown, structured XML, or JSON
- Smart Content Extraction: CSS selector support for targeted content extraction
- Robots.txt Compliance: Automatic robots.txt checking and compliance
- Rate Limiting: Built-in rate limiting (1 second minimum between requests)
- Size Protection: Maximum page size limit (10MB) to prevent memory issues
- Structured Content: Extract headings, paragraphs, links, images, and lists separately
- Error Handling: Comprehensive error codes for different failure scenarios
MCP Client Configuration
Add this server to your MCP client configuration:
{
"mcpServers": {
"open-crawler": {
"command": "npx",
"args": ["@elchika-inc/open-crawler-mcp-server"]
}
}
}
Available Tools
crawl_page
Extracts content from a web page in multiple formats with automatic robots.txt compliance checking.
Parameters:
url(required): Target URL to crawlselector(optional): CSS selector for specific content extractionformat(optional): Output format -text,markdown,xml, orjson(default:text)text_only(optional): Legacy parameter for text-only extraction (deprecated, useformatinstead)
Output Formats:
text: Clean, plain text content with whitespace normalizedmarkdown: Well-formatted Markdown with headings, links, images, and lists preservedxml: Structured XML with separate sections for headings, paragraphs, links, images, and listsjson: Structured JSON object containing categorized content elements
Examples:
Basic text extraction:
{
"name": "crawl_page",
"arguments": {
"url": "https://example.com",
"format": "text"
}
}
Markdown extraction with CSS selector:
{
"name": "crawl_page",
"arguments": {
"url": "https://example.com",
"selector": "article",
"format": "markdown"
}
}
Structured JSON extraction:
{
"name": "crawl_page",
"arguments": {
"url": "https://example.com",
"format": "json"
}
}
check_robots
Validates if a URL is allowed to be crawled according to the site's robots.txt file.
Parameters:
url(required): URL to check for crawling permission
Example:
{
"name": "check_robots",
"arguments": {
"url": "https://example.com/page"
}
}
Error Handling
Common error scenarios:
- Network connection issues
- Invalid HTML or missing content
- Robots.txt restrictions
- Request timeouts or rate limits
- Content size too large (>10MB)
License
MIT
Serveurs connexes
Bright Data
sponsorDiscover, extract, and interact with the web - one interface powering automated access across the public internet.
Career Site Jobs
A MCP server to retrieve up-to-date jobs from company career sites.
NBA Player Stats
Provides comprehensive NBA player statistics from basketball-reference.com, including career stats, season comparisons, and advanced metrics.
Crawl4AI RAG
Integrates web crawling and Retrieval-Augmented Generation (RAG) into AI agents and coding assistants.
YouTube MCP Server
Extract metadata and captions from YouTube videos and convert them to markdown.
YouTube Video Summarizer MCP
Fetch and summarize YouTube videos by extracting titles, descriptions, and transcripts.
DeepResearch MCP
A powerful research assistant for conducting iterative web searches, analysis, and report generation.
Lightpanda Go MCP server
A Go-based MCP server for interacting with the Lightpanda Browser using the Chrome DevTools Protocol (CDP).
Conduit
Headless browser with SHA-256 hash-chained audit trails and Ed25519-signed proof bundles. MCP server for AI agents.
MCP Go Colly Crawler
A web crawling framework that integrates the Model Context Protocol (MCP) with the Colly web scraping library.
Website to Markdown MCP Server
Fetches and converts website content to Markdown with AI-powered cleanup, OpenAPI support, and stealth browsing.