Open Crawler MCP Server

A web crawler and text extractor with robots.txt compliance, rate limiting, and page size protection.

Open Crawler MCP Server

A Model Context Protocol (MCP) server for web crawling and content extraction from web pages with multiple output formats.

Features

Multiple Output Formats: Extract content as text, markdown, structured XML, or JSON
Smart Content Extraction: CSS selector support for targeted content extraction
Robots.txt Compliance: Automatic robots.txt checking and compliance
Rate Limiting: Built-in rate limiting (1 second minimum between requests)
Size Protection: Maximum page size limit (10MB) to prevent memory issues
Structured Content: Extract headings, paragraphs, links, images, and lists separately
Error Handling: Comprehensive error codes for different failure scenarios

MCP Client Configuration

Add this server to your MCP client configuration:

{
  "mcpServers": {
    "open-crawler": {
      "command": "npx",
      "args": ["@elchika-inc/open-crawler-mcp-server"]
    }
  }
}

Available Tools

crawl_page

Extracts content from a web page in multiple formats with automatic robots.txt compliance checking.

Parameters:

url (required): Target URL to crawl
selector (optional): CSS selector for specific content extraction
format (optional): Output format - text, markdown, xml, or json (default: text)
text_only (optional): Legacy parameter for text-only extraction (deprecated, use format instead)

Output Formats:

text: Clean, plain text content with whitespace normalized
markdown: Well-formatted Markdown with headings, links, images, and lists preserved
xml: Structured XML with separate sections for headings, paragraphs, links, images, and lists
json: Structured JSON object containing categorized content elements

Examples:

Basic text extraction:

{
  "name": "crawl_page",
  "arguments": {
    "url": "https://example.com",
    "format": "text"
  }
}

Markdown extraction with CSS selector:

{
  "name": "crawl_page",
  "arguments": {
    "url": "https://example.com",
    "selector": "article",
    "format": "markdown"
  }
}

Structured JSON extraction:

{
  "name": "crawl_page",
  "arguments": {
    "url": "https://example.com",
    "format": "json"
  }
}

check_robots

Validates if a URL is allowed to be crawled according to the site's robots.txt file.

Parameters:

url (required): URL to check for crawling permission

Example:

{
  "name": "check_robots",
  "arguments": {
    "url": "https://example.com/page"
  }
}

Error Handling

Common error scenarios:

Network connection issues
Invalid HTML or missing content
Robots.txt restrictions
Request timeouts or rate limits
Content size too large (>10MB)

License

MIT

Related Servers

Bright Data

sponsor

Discover, extract, and interact with the web - one interface powering automated access across the public internet.

RedNote MCP

Access and interact with content from Xiaohongshu (RedNote).

Shufersal MCP Server

Automates shopping on the Shufersal website, enabling LLMs to search for products, create shopping lists, and manage the cart.

Dumpling AI MCP Server

Data scraping, conversion, and extraction tools from Dumpling AI.

Browser Use

An AI-driven browser automation server for natural language control and web research, with CLI access.

Read URL MCP

Extracts web content from a URL and converts it to clean Markdown format.

MCP URL2SNAP

A lightweight MCP server that captures screenshots of any URL and returns the image URL. Requires an AbstractAPI key.

MCP Node Fetch

Fetch web content using the Node.js undici library.

MCP YouTube Extract

Extracts information from YouTube videos and channels using the YouTube Data API.

HTML to Markdown MCP

Fetch web pages and convert HTML to clean, formatted Markdown. Handles large pages with automatic file saving to bypass token limits.

MeteoSwiss Data

Provides weather reports, search, and content from the MeteoSwiss website with multi-language support.