Crawl4AI MCP Server
Claude AI için web kazıma becerisi. Web sitelerini tarar, CSS/LLM stratejileriyle yapılandırılmış veri çıkarır, dinamik JavaScript içeriğini işler. Tam SDK referansı, örnek betikler ve testlerle birlikte crawl4ai üzerine inşa edilmiştir.
Dokümantasyon
Crawl4AI Agent Skill
Scrape JavaScript-heavy sites and extract structured data via reusable CSS schemas. A portable agent skill that wraps the Crawl4AI CLI and Python SDK, written in the Anthropic SKILL.md format and consumable by any agent host that loads SKILL.md-format bundles (Claude Code, Codex, Cursor, OpenCode, Cline, and others).
Verified against Crawl4AI library version 0.8.9 (pinned in VERSION).
Features
- JS-aware crawling: full headless-browser rendering with
wait_until=networkidledefaults - Schema-based extraction: derive a CSS selector schema once via LLM, apply it forever with no further LLM cost
- LLM extraction: per-request structured extraction when a schema is not worth deriving
- Content filtering: BM25 relevance filter and quality-based pruning, plain markdown or markdown-fit output
- Concurrent batch crawling: multi-URL processing with per-job concurrency caps
- Session management: persistent sessions for authenticated, multi-step flows
- CLI and SDK: both the
crwlcommand-line tool and thecrawl4aiPython SDK
Installation
Clone the repo into the skills directory your agent host loads from:
# Claude Code
git clone https://github.com/brettdavies/crawl4ai-skill.git ~/.claude/skills/crawl4ai
For other agent hosts (Codex, Cursor, OpenCode, Cline, custom agents), clone into whichever directory your host scans
for SKILL.md-format bundles. Refer to your host's documentation for the skills directory location. The bundle root
contains SKILL.md, so the skill registers automatically once the directory is on the host's skills search path.
Prerequisites
The skill calls into the Crawl4AI Python library, which must be installed in the runtime your agent uses:
pip install crawl4ai
crawl4ai-setup
crawl4ai-doctor
crawl4ai-doctor validates the install and confirms a headless browser is available.
Quick start
CLI:
crwl https://example.com -c "wait_until=networkidle,page_timeout=60000" -o markdown
crwl https://example.com -o json -v --bypass-cache
Python SDK:
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://example.com")
print(result.markdown[:500])
asyncio.run(main())
Bundle layout
| Path | Contents |
|---|---|
SKILL.md | Entry point: trigger conditions, defaults, routing to specialized pipelines |
references/ | Nine reference guides for CLI, SDK, extraction, filtering, anti-detection, URL discovery, escalation |
scripts/ | Six PEP 723 helper scripts for crawl / extract / batch workflows |
templates/ | Reusable YAML/JSON templates for browser, crawler, filters, and extraction strategies |
evals/ | Four eval scenarios for verifying skill behavior end-to-end |
fixtures/ | Schema-generation reference fixture (sample HTML, expected schema, expected JSON output) |
tests/ | Pytest suite covering basic crawling, markdown generation, extraction, advanced patterns, and fixtures |
VERSION | Pinned Crawl4AI library version the skill is verified against |
LICENSE-APACHE, LICENSE-MIT, LICENSE | Dual license texts and summary (SPDX MIT OR Apache-2.0) |
Documentation
- SKILL.md: complete skill documentation with examples
- CLI Guide: command-line interface reference
- SDK Guide: Python SDK quick reference
- Complete SDK Reference: full API documentation (5900+ lines)
- Recipes: end-to-end task recipes (login flow, sitemap crawl, paginated extraction)
- Content Filters: BM25 vs pruning vs LLMContentFilter trade-offs
- URL Discovery: sitemap, robots.txt, link-graph traversal
- Anti-Detection: init scripts, proxy config, undetected mode, CDP attachment
- Troubleshooting: symptoms, causes, fixes
- Escalation: lookup order, halt-vs-continue criteria, worked examples
Common use cases
Documentation to markdown
crwl https://docs.example.com -o markdown > docs.md
E-commerce product monitoring
# Derive the schema once (uses LLM)
./scripts/generate_schema.py https://shop.example.com "products with name, price, image" shop_schema.json
# Apply the saved schema (no LLM cost per request)
./scripts/extract_with_schema.py https://shop.example.com shop_schema.json products.json
News aggregation with relevance filtering
for url in news1.com news2.com news3.com; do
crwl "https://$url" -f templates/filter_bm25.yml -o markdown-fit
done
Scripts
| Script | Purpose |
|---|---|
scripts/basic_crawler.py <url> | One URL → markdown + screenshot |
scripts/batch_crawl.py <urls.txt> | Many URLs → markdown files |
scripts/batch_extract.py <urls.txt> <schema.json> | Many URLs + schema → JSON |
scripts/generate_schema.py <url> "<instruction>" | Derive a reusable CSS schema (one-time LLM call) |
scripts/extract_with_schema.py <url> <schema.json> | Apply a saved schema (no LLM) |
scripts/extract_with_llm.py <url> "<instruction>" | Per-request LLM extraction (expensive; one-off only) |
Testing
cd tests
python run_all_tests.py
License
Dual-licensed under Apache License 2.0 (LICENSE-APACHE) or MIT License (LICENSE-MIT) at
your option. SPDX identifier: MIT OR Apache-2.0. See LICENSE for the full notice.
Contributing
Contributions welcome. Open a pull request.
Changelog
See CHANGELOG.md.