Open Crawler MCP Server

A web crawler and content extractor that supports multiple output formats like text, markdown, and JSON.

Open Crawler MCP Server

A Model Context Protocol (MCP) server for web crawling and content extraction from web pages with multiple output formats.

Features

Multiple Output Formats: Extract content as text, markdown, structured XML, or JSON
Smart Content Extraction: CSS selector support for targeted content extraction
Robots.txt Compliance: Automatic robots.txt checking and compliance
Rate Limiting: Built-in rate limiting (1 second minimum between requests)
Size Protection: Maximum page size limit (10MB) to prevent memory issues
Structured Content: Extract headings, paragraphs, links, images, and lists separately
Error Handling: Comprehensive error codes for different failure scenarios

MCP Client Configuration

Add this server to your MCP client configuration:

{
  "mcpServers": {
    "open-crawler": {
      "command": "npx",
      "args": ["@elchika-inc/open-crawler-mcp-server"]
    }
  }
}

Available Tools

crawl_page

Extracts content from a web page in multiple formats with automatic robots.txt compliance checking.

Parameters:

url (required): Target URL to crawl
selector (optional): CSS selector for specific content extraction
format (optional): Output format - text, markdown, xml, or json (default: text)
text_only (optional): Legacy parameter for text-only extraction (deprecated, use format instead)

Output Formats:

text: Clean, plain text content with whitespace normalized
markdown: Well-formatted Markdown with headings, links, images, and lists preserved
xml: Structured XML with separate sections for headings, paragraphs, links, images, and lists
json: Structured JSON object containing categorized content elements

Examples:

Basic text extraction:

{
  "name": "crawl_page",
  "arguments": {
    "url": "https://example.com",
    "format": "text"
  }
}

Markdown extraction with CSS selector:

{
  "name": "crawl_page",
  "arguments": {
    "url": "https://example.com",
    "selector": "article",
    "format": "markdown"
  }
}

Structured JSON extraction:

{
  "name": "crawl_page",
  "arguments": {
    "url": "https://example.com",
    "format": "json"
  }
}

check_robots

Validates if a URL is allowed to be crawled according to the site's robots.txt file.

Parameters:

url (required): URL to check for crawling permission

Example:

{
  "name": "check_robots",
  "arguments": {
    "url": "https://example.com/page"
  }
}

Error Handling

Common error scenarios:

Network connection issues
Invalid HTML or missing content
Robots.txt restrictions
Request timeouts or rate limits
Content size too large (>10MB)

License

MIT

Servidores relacionados

Bright Data

patrocinador

Discover, extract, and interact with the web - one interface powering automated access across the public internet.

MCP Undetected Chromedriver

Automate Chrome browser control while bypassing anti-bot detection using undetected-chromedriver.

MCP NPX Fetch

Fetch and transform web content into various formats like HTML, JSON, Markdown, or Plain Text.

LinkedIn Profile Scraper

Fetches LinkedIn profile information using the Fresh LinkedIn Profile Data API.

Trends Hub

Aggregates trending topics from over 20 sources in real-time, with customizable fields and RSS feed support.

Agentic Deep Researcher

A deep research agent powered by Crew AI and the LinkUp API.

Fetcher MCP

Fetch and extract web content using a Playwright headless browser, with support for intelligent extraction and flexible output.

Bilibili

Interact with the Bilibili video website, enabling actions like searching for videos, retrieving video information, and accessing user data.

Automatic MCP Discovery

AI powered automation toolkit which acts as an agent that discovers MCP servers for you. Point it at GitHub/npm/configure your own discovery, let GPT or Claude analyze the API or MCP or any tool, get ready-to-ship plugin configs. Zero manual work.

Fetch

Web content fetching and conversion for efficient LLM usage

SubDownload

Public MCP wrapper for SubDownload.Fetch YouTube transcripts, search videos, browse channels and playlists — instant YouTube data for your AI workflow.

Open Crawler MCP Server

Open Crawler MCP Server

Features

MCP Client Configuration

Available Tools

crawl_page

check_robots

Error Handling

License

Servidores relacionados

Bright Data

MCP Undetected Chromedriver

MCP NPX Fetch

LinkedIn Profile Scraper

Trends Hub

Agentic Deep Researcher

Fetcher MCP

Bilibili

Automatic MCP Discovery

Fetch

SubDownload

NotebookLM Web Importer