Open Crawler MCP Server

A web crawler and content extractor that supports multiple output formats like text, markdown, and JSON.

Open Crawler MCP Server

A Model Context Protocol (MCP) server for web crawling and content extraction from web pages with multiple output formats.

Features

Multiple Output Formats: Extract content as text, markdown, structured XML, or JSON
Smart Content Extraction: CSS selector support for targeted content extraction
Robots.txt Compliance: Automatic robots.txt checking and compliance
Rate Limiting: Built-in rate limiting (1 second minimum between requests)
Size Protection: Maximum page size limit (10MB) to prevent memory issues
Structured Content: Extract headings, paragraphs, links, images, and lists separately
Error Handling: Comprehensive error codes for different failure scenarios

MCP Client Configuration

Add this server to your MCP client configuration:

{
  "mcpServers": {
    "open-crawler": {
      "command": "npx",
      "args": ["@elchika-inc/open-crawler-mcp-server"]
    }
  }
}

Available Tools

crawl_page

Extracts content from a web page in multiple formats with automatic robots.txt compliance checking.

Parameters:

url (required): Target URL to crawl
selector (optional): CSS selector for specific content extraction
format (optional): Output format - text, markdown, xml, or json (default: text)
text_only (optional): Legacy parameter for text-only extraction (deprecated, use format instead)

Output Formats:

text: Clean, plain text content with whitespace normalized
markdown: Well-formatted Markdown with headings, links, images, and lists preserved
xml: Structured XML with separate sections for headings, paragraphs, links, images, and lists
json: Structured JSON object containing categorized content elements

Examples:

Basic text extraction:

{
  "name": "crawl_page",
  "arguments": {
    "url": "https://example.com",
    "format": "text"
  }
}

Markdown extraction with CSS selector:

{
  "name": "crawl_page",
  "arguments": {
    "url": "https://example.com",
    "selector": "article",
    "format": "markdown"
  }
}

Structured JSON extraction:

{
  "name": "crawl_page",
  "arguments": {
    "url": "https://example.com",
    "format": "json"
  }
}

check_robots

Validates if a URL is allowed to be crawled according to the site's robots.txt file.

Parameters:

url (required): URL to check for crawling permission

Example:

{
  "name": "check_robots",
  "arguments": {
    "url": "https://example.com/page"
  }
}

Error Handling

Common error scenarios:

Network connection issues
Invalid HTML or missing content
Robots.txt restrictions
Request timeouts or rate limits
Content size too large (>10MB)

License

MIT

Máy chủ liên quan

Bright Data

nhà tài trợ

Discover, extract, and interact with the web - one interface powering automated access across the public internet.

SERP Scraper MCP

Extract structured Google & Bing results — organic, ads, featured snippets, PAA, related searches. Keyword research and rank checking. Free alternative to SerpApi. No API keys required.

MCP Browser Agent

A browser automation agent using the Model Context Protocol (MCP) to enable browser interactions.

visa-jobs-mcp

Identify US visa-sponsoring opportunities on LinkedIn and access the right contacts to accelerate your outreach.

Playwright SSE MCP Server

An MCP server that provides Playwright features for web scraping and browser automation.

MCP NPX Fetch

Fetch and transform web content into various formats like HTML, JSON, Markdown, or Plain Text.

Social & Content MCP Server

Trending content from Hacker News, Dev.to, IMDb, podcasts, and Eventbrite

B2Proxy

1GB Free Trial, World's Leading Proxy Service Platform, Efficient Data Collection

Plasmate MCP

Agent-native headless browser that converts web pages to structured Semantic Object Model (SOM) JSON -- 4x fewer tokens than raw HTML with lower latency on Claude and GPT-4o.

MCP Node Fetch

Fetch web content using the Node.js undici library.

LinkedIn Profile Scraper

Fetches LinkedIn profile information using the Fresh LinkedIn Profile Data API.

Open Crawler MCP Server

Open Crawler MCP Server

Features

MCP Client Configuration

Available Tools

crawl_page

check_robots

Error Handling

License

Máy chủ liên quan

Bright Data

SERP Scraper MCP

MCP Browser Agent

visa-jobs-mcp

Playwright SSE MCP Server

MCP NPX Fetch

Social & Content MCP Server

B2Proxy

Plasmate MCP

MCP Node Fetch

LinkedIn Profile Scraper

NotebookLM Web Importer