TheCrawler
Web scraper exposing 5 MCP tools — crawl, markdown extraction, search-and-crawl, sitemap parsing, and LLM JSON-schema structured extraction. AGPL-3.0.
TheCrawler — AI-ready web scraper with LLM-powered structured extraction
Scrape any URL and get rich structured data, or extract typed JSON via your own LLM in one call. Open source (AGPL-3.0). $0.005 per page.
What makes this different
- LLM-powered extraction: send a JSON Schema, get parsed typed data back. Endpoint-agnostic — point at OpenAI, your own llama.cpp / vLLM / LM Studio / Ollama. You bring the LLM, no vendor lock-in.
- Adaptive crawling: Cheerio first (fast HTTP+parse), auto-fall-back to Playwright when an SPA shell is detected. Saves real cost on static sites — competitors render JS on every page.
- Structured errors:
errorTypeenum (dns | timeout | rate-limit | blocked-bot | js-required | http-4xx | http-5xx | parse | network | unknown) +errorRetryableboolean. Agents branch programmatically — no regex on error strings. - Anti-bot detection: 200 OK responses with Cloudflare/WAF challenge bodies are flagged as
errorType: 'blocked-bot'instead of returning the challenge HTML. - Out-of-box extractors: JSON-LD, microdata, commerce data (price/SKU/rating), forms with field types, 16 analytics trackers detected (GA4, GTM, Meta Pixel, Hotjar, Segment, Mixpanel, etc.), hreflang, pagination, redirect chain. Both Firecrawl and the standard Apify Web Scraper require user-written code for any of these.
- Heading-aware RAG chunking: markdown chunked at h1-h3 boundaries with overlap and per-chunk SHA. Feed straight to a vector DB.
Two modes
Plain crawl (default)
{
"urls": ["https://example.com"],
"extractMarkdown": true,
"rotateUserAgent": true,
"requestRetries": 3
}
Returns rich PageData per URL: title, description, language, canonical URL, robots directives, full text, boilerplate-stripped markdown, links (with internal/external flag), images (with lazy-load src), meta tags, OG/Twitter Card, JSON-LD, microdata, commerce data, forms, analytics-detected, emails, phones, social links, hreflang, pagination, redirect chain, response headers + timing, plus structured errorType + errorRetryable on failure.
LLM-powered extract mode
{
"urls": ["https://shop.example.com/products/123"],
"extractMode": true,
"extractJsonSchema": {
"type": "object",
"properties": {
"productName": { "type": "string" },
"price": { "type": "number" },
"currency": { "type": "string" },
"inStock": { "type": "boolean" }
},
"required": ["productName"]
},
"llmBaseUrl": "https://api.openai.com/v1/chat/completions",
"llmModel": "gpt-4o-mini"
}
Crawls the URL → cleans to markdown → sends (markdown + schema) to your OpenAI-compatible chat-completions endpoint with response_format: { type: 'json_object' } → returns parsed typed data per URL. Supports natural-language extractPrompt instead of/alongside the schema. The actor charges per page like normal; the LLM call cost is whatever your endpoint charges.
Note: extract mode requires a publicly-reachable LLM endpoint. LAN URLs (e.g.
http://192.168.x.x) are not reachable from Apify infrastructure. Use OpenAI, hosted vLLM, or expose your local server via a tunnel.
Set
THECRAWLER_LLM_API_KEYas an Actor environment variable so the LLM key never lands in run inputs (visible in run history).
Reliability features
| Feature | Default | Why |
|---|---|---|
requestRetries | 3 | Transient failures (5xx, network, timeout) auto-retried |
requestTimeoutSecs | 30 | Cap on per-request time |
rotateUserAgent | true | Cycles through 6 real-browser UA strings |
cacheEnabled | false | Opt-in 5-min in-memory LRU per (URL + extract-flags) |
| Anti-bot challenge detection | always on | Flags Cloudflare/WAF challenge bodies as errorType: 'blocked-bot' |
| Adaptive crawl | opt-in | adaptiveCrawling: true tries Cheerio first, escalates to Playwright on SPA detection |
Search → scrape
Top-N Google results crawled in one call. Optional SerpAPI key for reliable search.
{ "searchQuery": "best CRM 2026", "searchLimit": 10, "extractMarkdown": true }
Sitemap → scrape
Sitemap.xml + sitemap-index files resolved automatically.
{ "sitemapUrl": "https://example.com/sitemap.xml", "maxPages": 50 }
File extraction
PDF and DOCX URLs are auto-detected and parsed. Returns extracted text + (for PDFs) metadata, page count.
Pricing
- Crawl mode: $0.005 per page successfully scraped (failed pages don't charge).
- Extract mode: $0.005 per page now; will become $0.02 per page on/after 2026-05-30 (separate event for the higher LLM-inference compute, gated by Apify's pricing-cooldown rules).
Beyond the Apify Store
The same engine ships as the open-source thecrawler npm package — drop into your own Node project, MCP server, CLI, or REST API server. Self-hosted = $0 per call.
# Library
npm install thecrawler
# CLI
thecrawler crawl https://example.com --markdown
thecrawler extract https://example.com --schema '{...}'
# MCP server (Claude Code, Cursor, Windsurf)
npx -p thecrawler thecrawler-mcp
# REST API server
npx -p thecrawler thecrawler-api --port 3000
GitHub: https://github.com/manchittlab/TheCrawler · License: AGPL-3.0
Serveurs connexes
Bright Data
sponsorDiscover, extract, and interact with the web - one interface powering automated access across the public internet.
WebScraping.AI
Interact with WebScraping.AI for web data extraction and scraping.
1001Proxy - Proxy MCP Server for AI Agents
Use Claude, OpenAI Cursor, and any MCP-compatible AI agent to buy and manage proxies using natural language. No custom integrations needed - simply connect your client to the server and start chatting.
Extract Developer & LLM Docs
Extract documentation for AI agents from any site with llms.txt support. Features MCP server, REST API, batch processing, and multiple export formats.
SubDownload
Public MCP wrapper for SubDownload.Fetch YouTube transcripts, search videos, browse channels and playlists — instant YouTube data for your AI workflow.
Crawl4AI RAG
Integrates web crawling and Retrieval-Augmented Generation (RAG) into AI agents and coding assistants.
Google News Trends MCP
Access Google News and Google Trends data without paid APIs.
youtube-summarize
MCP server that fetches YouTube video transcripts and summarizes them using your LLM client
freesound-mcp
A Model Context Protocol (MCP) server that enables AI applications to search and download audio resources from the Freesound platform via natural language commands.
Claimify
Extracts factual claims from text using the Claimify methodology. Requires an OpenAI API key.
Browser Use MCP Server
An MCP server that allows AI agents to control a web browser using the browser-use library.