Crawl4AI MCP Server

Kemampuan scraping web untuk Claude AI. Merayapi situs web, mengekstrak data terstruktur dengan strategi CSS/LLM, menangani konten JavaScript dinamis. Dibangun di atas crawl4ai dengan referensi SDK lengkap, skrip contoh, dan pengujian.

Dokumentasi

Crawl4AI Agent Skill

Scrape JavaScript-heavy sites and extract structured data via reusable CSS schemas. A portable agent skill that wraps the Crawl4AI CLI and Python SDK, written in the Anthropic SKILL.md format and consumable by any agent host that loads SKILL.md-format bundles (Claude Code, Codex, Cursor, OpenCode, Cline, and others).

Verified against Crawl4AI library version 0.8.9 (pinned in VERSION).

Features

  • JS-aware crawling: full headless-browser rendering with wait_until=networkidle defaults
  • Schema-based extraction: derive a CSS selector schema once via LLM, apply it forever with no further LLM cost
  • LLM extraction: per-request structured extraction when a schema is not worth deriving
  • Content filtering: BM25 relevance filter and quality-based pruning, plain markdown or markdown-fit output
  • Concurrent batch crawling: multi-URL processing with per-job concurrency caps
  • Session management: persistent sessions for authenticated, multi-step flows
  • CLI and SDK: both the crwl command-line tool and the crawl4ai Python SDK

Installation

Clone the repo into the skills directory your agent host loads from:

# Claude Code
git clone https://github.com/brettdavies/crawl4ai-skill.git ~/.claude/skills/crawl4ai

For other agent hosts (Codex, Cursor, OpenCode, Cline, custom agents), clone into whichever directory your host scans for SKILL.md-format bundles. Refer to your host's documentation for the skills directory location. The bundle root contains SKILL.md, so the skill registers automatically once the directory is on the host's skills search path.

Prerequisites

The skill calls into the Crawl4AI Python library, which must be installed in the runtime your agent uses:

pip install crawl4ai
crawl4ai-setup
crawl4ai-doctor

crawl4ai-doctor validates the install and confirms a headless browser is available.

Quick start

CLI:

crwl https://example.com -c "wait_until=networkidle,page_timeout=60000" -o markdown
crwl https://example.com -o json -v --bypass-cache

Python SDK:

import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun("https://example.com")
        print(result.markdown[:500])

asyncio.run(main())

Bundle layout

PathContents
SKILL.mdEntry point: trigger conditions, defaults, routing to specialized pipelines
references/Nine reference guides for CLI, SDK, extraction, filtering, anti-detection, URL discovery, escalation
scripts/Six PEP 723 helper scripts for crawl / extract / batch workflows
templates/Reusable YAML/JSON templates for browser, crawler, filters, and extraction strategies
evals/Four eval scenarios for verifying skill behavior end-to-end
fixtures/Schema-generation reference fixture (sample HTML, expected schema, expected JSON output)
tests/Pytest suite covering basic crawling, markdown generation, extraction, advanced patterns, and fixtures
VERSIONPinned Crawl4AI library version the skill is verified against
LICENSE-APACHE, LICENSE-MIT, LICENSEDual license texts and summary (SPDX MIT OR Apache-2.0)

Documentation

Common use cases

Documentation to markdown

crwl https://docs.example.com -o markdown > docs.md

E-commerce product monitoring

# Derive the schema once (uses LLM)
./scripts/generate_schema.py https://shop.example.com "products with name, price, image" shop_schema.json

# Apply the saved schema (no LLM cost per request)
./scripts/extract_with_schema.py https://shop.example.com shop_schema.json products.json

News aggregation with relevance filtering

for url in news1.com news2.com news3.com; do
  crwl "https://$url" -f templates/filter_bm25.yml -o markdown-fit
done

Scripts

ScriptPurpose
scripts/basic_crawler.py <url>One URL → markdown + screenshot
scripts/batch_crawl.py <urls.txt>Many URLs → markdown files
scripts/batch_extract.py <urls.txt> <schema.json>Many URLs + schema → JSON
scripts/generate_schema.py <url> "<instruction>"Derive a reusable CSS schema (one-time LLM call)
scripts/extract_with_schema.py <url> <schema.json>Apply a saved schema (no LLM)
scripts/extract_with_llm.py <url> "<instruction>"Per-request LLM extraction (expensive; one-off only)

Testing

cd tests
python run_all_tests.py

License

Dual-licensed under Apache License 2.0 (LICENSE-APACHE) or MIT License (LICENSE-MIT) at your option. SPDX identifier: MIT OR Apache-2.0. See LICENSE for the full notice.

Contributing

Contributions welcome. Open a pull request.

Changelog

See CHANGELOG.md.