search-scrape
Self-hosted Stealth Scraping & Federated Search for AI Agents. A 100% private, free alternative to Firecrawl, Jina Reader, and Tavily. Featuring Universal Anti-bot Bypass + Semantic Research Memory, Copy-Paste setup
๐ฅท ShadowCrawl MCP
ShadowCrawl is not just a scraperโit's a Cyborg Intelligence Layer. While other APIs fail against Cloudflare, Akamai, and PerimetterX, ShadowCrawl leverages a unique Human-AI Collaboration model to achieve a near-perfect bypass rate on even the most guarded "Boss Level" sites (LinkedIn, Airbnb, Ticketmaster).
๐ Why ShadowCrawl?
- 99.99% Bot Bypass: Featuring the "Non-Robot Search" engine. When automation hits a wall, ShadowCrawl bridges the gap with Human-In-The-Loop (HITL) interaction, allowing you to solve CAPTCHAs and login walls manually while the agent continues its work.
- Total Sovereignty: 100% Private. Self-hosted via Docker. No API keys, no monthly fees, and no third-party data tracking.
- Agent-Native (MCP): Deeply integrated with Cursor, Claude Desktop, and IDEs via the Model Context Protocol. Your AI agent now has eyes and hands in the real web.
- Universal Noise Reduction: Advanced Rust-based filtering that collapses "Skeleton Screens" and repeats, delivering clean, semantic Markdown that reduces LLM token costs.
๐ The "Nuclear Option": Non-Robot Search (HITL)
Most scrapers try to "act" like a human and fail. ShadowCrawl uses a human when it matters.
non_robot_search is our flagship tool for high-fidelity rendering. It launches a visible, native Brave Browser instance on your machine.
- Manual Intervention: If a site asks for a Login or a Puzzle, you solve it once; the agent scrapes the rest.
- Brave Integration: Uses your actual browser profiles (cookies/sessions) to look like a legitimate user, not a headless bot.
- Stealth Cleanup: Automatically strips automation markers (
navigator.webdriver, etc.) before extraction.
๐ฅ Shattering the "Unscrapable" (Anti-Bot Bypass)
Most scraping APIs surrender when facing enterprise-grade shields. ShadowCrawl is the Hammer that breaks through. We successfully bypass and extract data from:
- Cloudflare ๐ก๏ธ (Turnstile / Challenge Pages)
- DataDome ๐ค (Interstitial & Behavioral blocks)
- Akamai ๐ฐ (Advanced Bot Manager)
- PerimeterX / HUMAN ๐ค
- Kasada & Shape Security ๐
The Secret? The Cyborg Approach (HITL). ShadowCrawl doesn't just "imitate" a humanโit bridges your real, native Brave/Chrome session into the agent's workflow. If a human can see it, ShadowCrawl can scrape it.
๐ Verified Evidence (Boss-Level Targets)
We don't just claim to bypassโwe provide the receipts. All evidence below was captured using non_robot_search with the Safety Kill Switch enabled (2026-02-14).
| Target Site | Protection | Evidence Size | Data Extracted | Status |
|---|---|---|---|---|
| Cloudflare + Auth | 413KB | ๐ JSON ยท ๐ Snippet | 60+ job IDs, listings โ | |
| Ticketmaster | Cloudflare Turnstile | 1.1MB | ๐ JSON ยท ๐ Snippet | Tour dates, venues โ |
| Airbnb | DataDome | 1.8MB | ๐ JSON ยท ๐ Snippet | 1000+ Tokyo listings โ |
| Upwork | reCAPTCHA | 300KB | ๐ JSON ยท ๐ Snippet | 160K+ job postings โ |
| Amazon | AWS Shield | 814KB | ๐ JSON ยท ๐ Snippet | RTX 5070 Ti results โ |
| nowsecure.nl | Cloudflare | 168KB | ๐ JSON ยท ๐ธ Screenshot | Manual button tested โ |
๐ Full Documentation: See proof/README.md for verification steps, protection analysis, and quality metrics.
๐ Features at a Glance
| Feature | Description |
|---|---|
| Search & Discovery | Federated search via SearXNG. Finds what Google hides. |
| Deep Crawling | Recursive, bounded crawling to map entire subdomains. |
| Semantic Memory | (Optional) Qdrant integration for long-term research recall. |
| Proxy Master | Native rotation logic for HTTP/SOCKS5 pools. |
| Hydration Scraper | Specialized logic to extract "hidden" JSON data from React/Next.js sites. |
| Universal Janitor | Automatic removal of popups, cookie banners, and overlays. |
๐ Comparison
| Feature | Firecrawl / Jina | ShadowCrawl |
|---|---|---|
| Cost | Monthly Subscription | $0 (Self-hosted) |
| Privacy | They see your data | 100% Private |
| LinkedIn/Airbnb | Often Blocked | 99.99% Success (via HITL) |
| JS Rendering | Cloud-only | Native Brave / Browserless |
| Memory | None | Semantic Research History |
๐ฆ Quick Start (Bypass in 60 Seconds)
1. The Docker Way (Full Stack)
Docker is the fastest way to bring up the full stack (SearXNG, proxy manager, etc.).
Important: Docker mode cannot use the HITL/GUI renderer (non_robot_search) because containers cannot reliably access your host's native Brave/Chrome window, keyboard hooks, and OS permissions.
Use the Native Rust Way below when you want boss-level bypass.
# Clone and Launch
git clone https://github.com/DevsHero/shadowcrawl.git
cd shadowcrawl
docker compose -f docker-compose-local.yml up -d --build
2. The Native Rust Way (Required for non_robot_search / HITL)
For the 99.99% bypass (HITL), you must run natively (tested on macOS; Linux (Desktop) may work but is less battle-tested).
Build the MCP stdio server with the HITL feature enabled:
cd mcp-server
cargo build --release --bin shadowcrawl-mcp --features non_robot_search
This produces the local MCP binary at:
mcp-server/target/release/shadowcrawl-mcp
Prereqs (macOS):
- Install Brave Browser (recommended) or Google Chrome
- Grant Accessibility permissions (required for the emergency ESC hold-to-abort kill switch)
๐งฉ MCP Integration (Cursor / Claude / VS Code)
ShadowCrawl can run as an MCP server in 2 modes:
- Docker MCP server: great for normal scraping/search tools, but cannot do HITL/GUI (
non_robot_search). - Local MCP server (
shadowcrawl-local): required for HITL tools (a visible Brave/Chrome window).
Option A: Docker MCP server (no non_robot_search)
Add this to your MCP config to use the Dockerized server:
{
"mcpServers": {
"shadowcrawl": {
"command": "docker",
"args": [
"compose",
"-f",
"/YOUR_PATH/shadowcrawl/docker-compose-local.yml",
"exec",
"-i",
"-T",
"shadowcrawl",
"shadowcrawl-mcp"
]
}
}
}
Option B: Local MCP server (required for non_robot_search)
If you want to use HITL tools like non_robot_search, configure a local MCP server that launches the native binary.
VS Code MCP config example ("servers" format):
{
"servers": {
"shadowcrawl-local": {
"type": "stdio",
"command": "env",
"args": [
"RUST_LOG=info",
// Optional (only if you run the full stack locally):
"SEARXNG_URL=http://localhost:8890",
"BROWSERLESS_URL=http://localhost:3010",
"BROWSERLESS_TOKEN=mcp_stealth_session",
"QDRANT_URL=http://localhost:6344",
// Network + limits:
"HTTP_TIMEOUT_SECS=30",
"HTTP_CONNECT_TIMEOUT_SECS=10",
"OUTBOUND_LIMIT=32",
"MAX_CONTENT_CHARS=10000",
"MAX_LINKS=100",
// Optional (proxy manager):
"IP_LIST_PATH=/YOUR_PATH/shadowcrawl/ip.txt",
"PROXY_SOURCE_PATH=/YOUR_PATH/shadowcrawl/proxy_source.json",
// HITL / non_robot_search quality-of-life:
// "SHADOWCRAWL_NON_ROBOT_AUTO_ALLOW=1",
// "SHADOWCRAWL_RENDER_PROFILE_DIR=/YOUR_PROFILE_DIR",
// "CHROME_EXECUTABLE=/Applications/Brave Browser.app/Contents/MacOS/Brave Browser",
"/YOUR_PATH/shadowcrawl/mcp-server/target/release/shadowcrawl-mcp"
]
}
}
}
Notes:
- The user-facing name in this README is
non_robot_search(sometimes people mistype this as โnon_human_searchโ). - For HITL, prefer Brave + a real profile dir (
SHADOWCRAWL_RENDER_PROFILE_DIR) so cookies/sessions persist. - If you're running via Docker MCP server, HITL tools will either be unavailable or fail (no host GUI).
๐งพ Tool Metadata Overrides (tools_metadata.json)
ShadowCrawl supports an optional tools_metadata.json file that lets you override public tool names, titles, descriptions, and input hints exposed to MCP clients.
Why this exists:
- Different MCP clients (and different teams) prefer different wording and levels of detail.
- Clear, specific tool descriptions reduce confusion and help agents choose the right tool.
- You can align tool wording with your organizationโs acceptable-use / compliance guidelines without changing Rust code.
How it works:
- If present, the server loads
tools_metadata.jsonfrom the repo root (or fromSHADOWCRAWL_TOOLS_METADATA_PATH). - If missing/invalid, ShadowCrawl falls back to built-in safe defaults.
โ Acknowledgments & Support
ShadowCrawl is built with โค๏ธ by a Solo Developer for the open-source community. If this tool helped you bypass a $500/mo API, consider supporting its growth!
- Found a bug? Open an Issue.
- Want a feature? Submit a request!
- Love the project? Star the repo โญ or buy me a coffee to fuel more updates!
License: MIT. Free for personal and commercial use.
Related Servers
Research Task
An AI-powered research assistant that can investigate any topic using an interactive configuration wizard.
Searchcraft
Manage Searchcraft cluster's Documents, Indexes, Federations, Access Keys, and Analytics.
Carity MCP Server
Retrieve relevant data chunks from the Carity API based on search queries.
ไผไธๅบ็กไฟกๆฏๆๅก
Provides basic enterprise information services, including business registration, company profiles, shareholders, and key personnel.
Code Research MCP Server
Search and access programming resources from Stack Overflow, MDN, GitHub, npm, and PyPI.
HexDocs MCP
Semantic search for Hex package documentation. Requires local Elixir and Mix installation.
LLM Jukebox
Search, download, and extract information from YouTube music videos.
eRegulations MCP Server
An MCP server for the eRegulations API, providing access to regulatory information.
Brave Search
An MCP server for the Brave Search API, providing both web and local search capabilities.
Gemini MCP
Integrate search grounded Gemini output into your workflow.