Open Crawler MCP Server

A web crawler and content extractor that supports multiple output formats like text, markdown, and JSON.

Open Crawler MCP Server

A Model Context Protocol (MCP) server for web crawling and content extraction from web pages with multiple output formats.

Features

Multiple Output Formats: Extract content as text, markdown, structured XML, or JSON
Smart Content Extraction: CSS selector support for targeted content extraction
Robots.txt Compliance: Automatic robots.txt checking and compliance
Rate Limiting: Built-in rate limiting (1 second minimum between requests)
Size Protection: Maximum page size limit (10MB) to prevent memory issues
Structured Content: Extract headings, paragraphs, links, images, and lists separately
Error Handling: Comprehensive error codes for different failure scenarios

MCP Client Configuration

Add this server to your MCP client configuration:

{
  "mcpServers": {
    "open-crawler": {
      "command": "npx",
      "args": ["@elchika-inc/open-crawler-mcp-server"]
    }
  }
}

Available Tools

crawl_page

Extracts content from a web page in multiple formats with automatic robots.txt compliance checking.

Parameters:

url (required): Target URL to crawl
selector (optional): CSS selector for specific content extraction
format (optional): Output format - text, markdown, xml, or json (default: text)
text_only (optional): Legacy parameter for text-only extraction (deprecated, use format instead)

Output Formats:

text: Clean, plain text content with whitespace normalized
markdown: Well-formatted Markdown with headings, links, images, and lists preserved
xml: Structured XML with separate sections for headings, paragraphs, links, images, and lists
json: Structured JSON object containing categorized content elements

Examples:

Basic text extraction:

{
  "name": "crawl_page",
  "arguments": {
    "url": "https://example.com",
    "format": "text"
  }
}

Markdown extraction with CSS selector:

{
  "name": "crawl_page",
  "arguments": {
    "url": "https://example.com",
    "selector": "article",
    "format": "markdown"
  }
}

Structured JSON extraction:

{
  "name": "crawl_page",
  "arguments": {
    "url": "https://example.com",
    "format": "json"
  }
}

check_robots

Validates if a URL is allowed to be crawled according to the site's robots.txt file.

Parameters:

url (required): URL to check for crawling permission

Example:

{
  "name": "check_robots",
  "arguments": {
    "url": "https://example.com/page"
  }
}

Error Handling

Common error scenarios:

Network connection issues
Invalid HTML or missing content
Robots.txt restrictions
Request timeouts or rate limits
Content size too large (>10MB)

License

MIT

Open Crawler MCP Server

Open Crawler MCP Server

Features

MCP Client Configuration

Available Tools

crawl_page

check_robots

Error Handling

License

相關伺服器

Bright Data

YouTube

WebforAI Text Extractor

HasData

Horse Racing News

ShopGraph

Browser MCP

B2Proxy

Icypeas

WebSearch

neo-vision

NotebookLM 網頁匯入器