search-scrape

Self-hosted Stealth Scraping & Federated Search for AI Agents. A 100% private, free alternative to Firecrawl, Jina Reader, and Tavily. Featuring Universal Anti-bot Bypass + Semantic Research Memory, Copy-Paste setup

๐Ÿฅท ShadowCrawl MCP


ShadowCrawl is not just a scraperโ€”it's a Cyborg Intelligence Layer. While other APIs fail against Cloudflare, Akamai, and PerimetterX, ShadowCrawl leverages a unique Human-AI Collaboration model to achieve a near-perfect bypass rate on even the most guarded "Boss Level" sites (LinkedIn, Airbnb, Ticketmaster).

๐Ÿš€ Why ShadowCrawl?

  • 99.99% Bot Bypass: Featuring the "Non-Robot Search" engine. When automation hits a wall, ShadowCrawl bridges the gap with Human-In-The-Loop (HITL) interaction, allowing you to solve CAPTCHAs and login walls manually while the agent continues its work.
  • Total Sovereignty: 100% Private. Self-hosted via Docker. No API keys, no monthly fees, and no third-party data tracking.
  • Agent-Native (MCP): Deeply integrated with Cursor, Claude Desktop, and IDEs via the Model Context Protocol. Your AI agent now has eyes and hands in the real web.
  • Universal Noise Reduction: Advanced Rust-based filtering that collapses "Skeleton Screens" and repeats, delivering clean, semantic Markdown that reduces LLM token costs.

๐Ÿ’Ž The "Nuclear Option": Non-Robot Search (HITL)

Most scrapers try to "act" like a human and fail. ShadowCrawl uses a human when it matters.

non_robot_search is our flagship tool for high-fidelity rendering. It launches a visible, native Brave Browser instance on your machine.

  • Manual Intervention: If a site asks for a Login or a Puzzle, you solve it once; the agent scrapes the rest.
  • Brave Integration: Uses your actual browser profiles (cookies/sessions) to look like a legitimate user, not a headless bot.
  • Stealth Cleanup: Automatically strips automation markers (navigator.webdriver, etc.) before extraction.

๐Ÿ’ฅ Shattering the "Unscrapable" (Anti-Bot Bypass)

Most scraping APIs surrender when facing enterprise-grade shields. ShadowCrawl is the Hammer that breaks through. We successfully bypass and extract data from:

  • Cloudflare ๐Ÿ›ก๏ธ (Turnstile / Challenge Pages)
  • DataDome ๐Ÿค– (Interstitial & Behavioral blocks)
  • Akamai ๐Ÿฐ (Advanced Bot Manager)
  • PerimeterX / HUMAN ๐Ÿ‘ค
  • Kasada & Shape Security ๐Ÿ”

The Secret? The Cyborg Approach (HITL). ShadowCrawl doesn't just "imitate" a humanโ€”it bridges your real, native Brave/Chrome session into the agent's workflow. If a human can see it, ShadowCrawl can scrape it.


๐Ÿ“‚ Verified Evidence (Boss-Level Targets)

We don't just claim to bypassโ€”we provide the receipts. All evidence below was captured using non_robot_search with the Safety Kill Switch enabled (2026-02-14).

Target SiteProtectionEvidence SizeData ExtractedStatus
LinkedInCloudflare + Auth413KB๐Ÿ“„ JSON ยท ๐Ÿ“ Snippet60+ job IDs, listings โœ…
TicketmasterCloudflare Turnstile1.1MB๐Ÿ“„ JSON ยท ๐Ÿ“ SnippetTour dates, venues โœ…
AirbnbDataDome1.8MB๐Ÿ“„ JSON ยท ๐Ÿ“ Snippet1000+ Tokyo listings โœ…
UpworkreCAPTCHA300KB๐Ÿ“„ JSON ยท ๐Ÿ“ Snippet160K+ job postings โœ…
AmazonAWS Shield814KB๐Ÿ“„ JSON ยท ๐Ÿ“ SnippetRTX 5070 Ti results โœ…
nowsecure.nlCloudflare168KB๐Ÿ“„ JSON ยท ๐Ÿ“ธ ScreenshotManual button tested โœ…

๐Ÿ“– Full Documentation: See proof/README.md for verification steps, protection analysis, and quality metrics.


๐Ÿ›  Features at a Glance

FeatureDescription
Search & DiscoveryFederated search via SearXNG. Finds what Google hides.
Deep CrawlingRecursive, bounded crawling to map entire subdomains.
Semantic Memory(Optional) Qdrant integration for long-term research recall.
Proxy MasterNative rotation logic for HTTP/SOCKS5 pools.
Hydration ScraperSpecialized logic to extract "hidden" JSON data from React/Next.js sites.
Universal JanitorAutomatic removal of popups, cookie banners, and overlays.

๐Ÿ† Comparison

FeatureFirecrawl / JinaShadowCrawl
CostMonthly Subscription$0 (Self-hosted)
PrivacyThey see your data100% Private
LinkedIn/AirbnbOften Blocked99.99% Success (via HITL)
JS RenderingCloud-onlyNative Brave / Browserless
MemoryNoneSemantic Research History

๐Ÿ“ฆ Quick Start (Bypass in 60 Seconds)

1. The Docker Way (Full Stack)

Docker is the fastest way to bring up the full stack (SearXNG, proxy manager, etc.).

Important: Docker mode cannot use the HITL/GUI renderer (non_robot_search) because containers cannot reliably access your host's native Brave/Chrome window, keyboard hooks, and OS permissions. Use the Native Rust Way below when you want boss-level bypass.

# Clone and Launch
git clone https://github.com/DevsHero/shadowcrawl.git
cd shadowcrawl
docker compose -f docker-compose-local.yml up -d --build

2. The Native Rust Way (Required for non_robot_search / HITL)

For the 99.99% bypass (HITL), you must run natively (tested on macOS; Linux (Desktop) may work but is less battle-tested).

Build the MCP stdio server with the HITL feature enabled:

cd mcp-server
cargo build --release --bin shadowcrawl-mcp --features non_robot_search

This produces the local MCP binary at:

  • mcp-server/target/release/shadowcrawl-mcp

Prereqs (macOS):

  • Install Brave Browser (recommended) or Google Chrome
  • Grant Accessibility permissions (required for the emergency ESC hold-to-abort kill switch)

๐Ÿงฉ MCP Integration (Cursor / Claude / VS Code)

ShadowCrawl can run as an MCP server in 2 modes:

  • Docker MCP server: great for normal scraping/search tools, but cannot do HITL/GUI (non_robot_search).
  • Local MCP server (shadowcrawl-local): required for HITL tools (a visible Brave/Chrome window).

Option A: Docker MCP server (no non_robot_search)

Add this to your MCP config to use the Dockerized server:

{
  "mcpServers": {
    "shadowcrawl": {
      "command": "docker",
      "args": [
        "compose",
        "-f",
        "/YOUR_PATH/shadowcrawl/docker-compose-local.yml",
        "exec",
        "-i",
        "-T",
        "shadowcrawl",
        "shadowcrawl-mcp"
      ]
    }
  }
}

Option B: Local MCP server (required for non_robot_search)

If you want to use HITL tools like non_robot_search, configure a local MCP server that launches the native binary.

VS Code MCP config example ("servers" format):

{
  "servers": {
    "shadowcrawl-local": {
      "type": "stdio",
      "command": "env",
      "args": [
        "RUST_LOG=info",

        // Optional (only if you run the full stack locally):
        "SEARXNG_URL=http://localhost:8890",
        "BROWSERLESS_URL=http://localhost:3010",
        "BROWSERLESS_TOKEN=mcp_stealth_session",
        "QDRANT_URL=http://localhost:6344",

        // Network + limits:
        "HTTP_TIMEOUT_SECS=30",
        "HTTP_CONNECT_TIMEOUT_SECS=10",
        "OUTBOUND_LIMIT=32",
        "MAX_CONTENT_CHARS=10000",
        "MAX_LINKS=100",

        // Optional (proxy manager):
        "IP_LIST_PATH=/YOUR_PATH/shadowcrawl/ip.txt",
        "PROXY_SOURCE_PATH=/YOUR_PATH/shadowcrawl/proxy_source.json",

        // HITL / non_robot_search quality-of-life:
        // "SHADOWCRAWL_NON_ROBOT_AUTO_ALLOW=1",
        // "SHADOWCRAWL_RENDER_PROFILE_DIR=/YOUR_PROFILE_DIR",
        // "CHROME_EXECUTABLE=/Applications/Brave Browser.app/Contents/MacOS/Brave Browser",

        "/YOUR_PATH/shadowcrawl/mcp-server/target/release/shadowcrawl-mcp"
      ]
    }
  }
}

Notes:

  • The user-facing name in this README is non_robot_search (sometimes people mistype this as โ€œnon_human_searchโ€).
  • For HITL, prefer Brave + a real profile dir (SHADOWCRAWL_RENDER_PROFILE_DIR) so cookies/sessions persist.
  • If you're running via Docker MCP server, HITL tools will either be unavailable or fail (no host GUI).

๐Ÿงพ Tool Metadata Overrides (tools_metadata.json)

ShadowCrawl supports an optional tools_metadata.json file that lets you override public tool names, titles, descriptions, and input hints exposed to MCP clients.

Why this exists:

  • Different MCP clients (and different teams) prefer different wording and levels of detail.
  • Clear, specific tool descriptions reduce confusion and help agents choose the right tool.
  • You can align tool wording with your organizationโ€™s acceptable-use / compliance guidelines without changing Rust code.

How it works:

  • If present, the server loads tools_metadata.json from the repo root (or from SHADOWCRAWL_TOOLS_METADATA_PATH).
  • If missing/invalid, ShadowCrawl falls back to built-in safe defaults.

โ˜• Acknowledgments & Support

ShadowCrawl is built with โค๏ธ by a Solo Developer for the open-source community. If this tool helped you bypass a $500/mo API, consider supporting its growth!

  • Found a bug? Open an Issue.
  • Want a feature? Submit a request!
  • Love the project? Star the repo โญ or buy me a coffee to fuel more updates!

Sponsor

License: MIT. Free for personal and commercial use.


Related Servers