TheCrawler

Web scraper exposing 5 MCP tools — crawl, markdown extraction, search-and-crawl, sitemap parsing, and LLM JSON-schema structured extraction. AGPL-3.0.

TheCrawler — AI-ready web scraper with LLM-powered structured extraction

Scrape any URL and get rich structured data, or extract typed JSON via your own LLM in one call. Open source (AGPL-3.0). $0.005 per page.

What makes this different

  • LLM-powered extraction: send a JSON Schema, get parsed typed data back. Endpoint-agnostic — point at OpenAI, your own llama.cpp / vLLM / LM Studio / Ollama. You bring the LLM, no vendor lock-in.
  • Adaptive crawling: Cheerio first (fast HTTP+parse), auto-fall-back to Playwright when an SPA shell is detected. Saves real cost on static sites — competitors render JS on every page.
  • Structured errors: errorType enum (dns | timeout | rate-limit | blocked-bot | js-required | http-4xx | http-5xx | parse | network | unknown) + errorRetryable boolean. Agents branch programmatically — no regex on error strings.
  • Anti-bot detection: 200 OK responses with Cloudflare/WAF challenge bodies are flagged as errorType: 'blocked-bot' instead of returning the challenge HTML.
  • Out-of-box extractors: JSON-LD, microdata, commerce data (price/SKU/rating), forms with field types, 16 analytics trackers detected (GA4, GTM, Meta Pixel, Hotjar, Segment, Mixpanel, etc.), hreflang, pagination, redirect chain. Both Firecrawl and the standard Apify Web Scraper require user-written code for any of these.
  • Heading-aware RAG chunking: markdown chunked at h1-h3 boundaries with overlap and per-chunk SHA. Feed straight to a vector DB.

Two modes

Plain crawl (default)

{
  "urls": ["https://example.com"],
  "extractMarkdown": true,
  "rotateUserAgent": true,
  "requestRetries": 3
}

Returns rich PageData per URL: title, description, language, canonical URL, robots directives, full text, boilerplate-stripped markdown, links (with internal/external flag), images (with lazy-load src), meta tags, OG/Twitter Card, JSON-LD, microdata, commerce data, forms, analytics-detected, emails, phones, social links, hreflang, pagination, redirect chain, response headers + timing, plus structured errorType + errorRetryable on failure.

LLM-powered extract mode

{
  "urls": ["https://shop.example.com/products/123"],
  "extractMode": true,
  "extractJsonSchema": {
    "type": "object",
    "properties": {
      "productName": { "type": "string" },
      "price": { "type": "number" },
      "currency": { "type": "string" },
      "inStock": { "type": "boolean" }
    },
    "required": ["productName"]
  },
  "llmBaseUrl": "https://api.openai.com/v1/chat/completions",
  "llmModel": "gpt-4o-mini"
}

Crawls the URL → cleans to markdown → sends (markdown + schema) to your OpenAI-compatible chat-completions endpoint with response_format: { type: 'json_object' } → returns parsed typed data per URL. Supports natural-language extractPrompt instead of/alongside the schema. The actor charges per page like normal; the LLM call cost is whatever your endpoint charges.

Note: extract mode requires a publicly-reachable LLM endpoint. LAN URLs (e.g. http://192.168.x.x) are not reachable from Apify infrastructure. Use OpenAI, hosted vLLM, or expose your local server via a tunnel.

Set THECRAWLER_LLM_API_KEY as an Actor environment variable so the LLM key never lands in run inputs (visible in run history).

Reliability features

FeatureDefaultWhy
requestRetries3Transient failures (5xx, network, timeout) auto-retried
requestTimeoutSecs30Cap on per-request time
rotateUserAgenttrueCycles through 6 real-browser UA strings
cacheEnabledfalseOpt-in 5-min in-memory LRU per (URL + extract-flags)
Anti-bot challenge detectionalways onFlags Cloudflare/WAF challenge bodies as errorType: 'blocked-bot'
Adaptive crawlopt-inadaptiveCrawling: true tries Cheerio first, escalates to Playwright on SPA detection

Search → scrape

Top-N Google results crawled in one call. Optional SerpAPI key for reliable search.

{ "searchQuery": "best CRM 2026", "searchLimit": 10, "extractMarkdown": true }

Sitemap → scrape

Sitemap.xml + sitemap-index files resolved automatically.

{ "sitemapUrl": "https://example.com/sitemap.xml", "maxPages": 50 }

File extraction

PDF and DOCX URLs are auto-detected and parsed. Returns extracted text + (for PDFs) metadata, page count.

Pricing

  • Crawl mode: $0.005 per page successfully scraped (failed pages don't charge).
  • Extract mode: $0.005 per page now; will become $0.02 per page on/after 2026-05-30 (separate event for the higher LLM-inference compute, gated by Apify's pricing-cooldown rules).

Beyond the Apify Store

The same engine ships as the open-source thecrawler npm package — drop into your own Node project, MCP server, CLI, or REST API server. Self-hosted = $0 per call.

# Library
npm install thecrawler

# CLI
thecrawler crawl https://example.com --markdown
thecrawler extract https://example.com --schema '{...}'

# MCP server (Claude Code, Cursor, Windsurf)
npx -p thecrawler thecrawler-mcp

# REST API server
npx -p thecrawler thecrawler-api --port 3000

GitHub: https://github.com/manchittlab/TheCrawler · License: AGPL-3.0

Related Servers

NotebookLM Web Importer

Import web pages and YouTube videos to NotebookLM with one click. Trusted by 200,000+ users.

Install Chrome Extension