Open Crawler MCP Server

A Model Context Protocol (MCP) server for web crawling and content extraction from web pages with multiple output formats.

Features

Multiple Output Formats: Extract content as text, markdown, structured XML, or JSON
Smart Content Extraction: CSS selector support for targeted content extraction
Robots.txt Compliance: Automatic robots.txt checking and compliance
Rate Limiting: Built-in rate limiting (1 second minimum between requests)
Size Protection: Maximum page size limit (10MB) to prevent memory issues
Structured Content: Extract headings, paragraphs, links, images, and lists separately
Error Handling: Comprehensive error codes for different failure scenarios

Add this server to your MCP client configuration:

{
  "mcpServers": {
    "open-crawler": {
      "command": "npx",
      "args": ["@elchika-inc/open-crawler-mcp-server"]
    }
  }
}

Extracts content from a web page in multiple formats with automatic robots.txt compliance checking.

Parameters:

url (required): Target URL to crawl
selector (optional): CSS selector for specific content extraction
format (optional): Output format - text, markdown, xml, or json (default: text)
text_only (optional): Legacy parameter for text-only extraction (deprecated, use format instead)

Output Formats:

text: Clean, plain text content with whitespace normalized
markdown: Well-formatted Markdown with headings, links, images, and lists preserved
xml: Structured XML with separate sections for headings, paragraphs, links, images, and lists
json: Structured JSON object containing categorized content elements

Examples:

Basic text extraction:

{
  "name": "crawl_page",
  "arguments": {
    "url": "https://example.com",
    "format": "text"
  }
}

Markdown extraction with CSS selector:

{
  "name": "crawl_page",
  "arguments": {
    "url": "https://example.com",
    "selector": "article",
    "format": "markdown"
  }
}

Structured JSON extraction:

{
  "name": "crawl_page",
  "arguments": {
    "url": "https://example.com",
    "format": "json"
  }
}

Validates if a URL is allowed to be crawled according to the site's robots.txt file.

Parameters:

Example:

{
  "name": "check_robots",
  "arguments": {
    "url": "https://example.com/page"
  }
}

Common error scenarios:

MIT