tavily-crawl

作者： tavily-ai

多頁面網站爬蟲，具備語意過濾與Markdown匯出功能。可透過深度與廣度控制爬取整個網站區塊；依路徑正則表達式、網域或自然語言指令進行過濾，以聚焦結果。透過--output-dir將每個頁面儲存為本機Markdown檔案，或回傳結構化JSON供代理處理。使用語意指令搭配區塊提取，避免將結果餵入LLM時發生上下文膨脹；採用全頁提取進行離線文件下載。支援...

npx skills add https://github.com/tavily-ai/skills --skill tavily-crawl

下載 ZIP GitHub

397

tavily crawl

Crawl a website and extract content from multiple pages. Supports saving each page as a local markdown file.

Before running any command

If tvly is not found on PATH, install it first:

curl -fsSL https://cli.tavily.com/install.sh | bash && tvly login

Do not skip this step or fall back to other tools.

See tavily-cli for alternative install methods and auth options.

When to use

You need content from many pages on a site (e.g., all /docs/)
You want to download documentation for offline use
Step 4 in the workflow: search → extract → map → crawl → research

Quick start

# Basic crawl
tvly crawl "https://docs.example.com" --json

# Save each page as a markdown file
tvly crawl "https://docs.example.com" --output-dir ./docs/

# Deeper crawl with limits
tvly crawl "https://docs.example.com" --max-depth 2 --limit 50 --json

# Filter to specific paths
tvly crawl "https://example.com" --select-paths "/api/.*,/guides/.*" --exclude-paths "/blog/.*" --json

# Semantic focus (returns relevant chunks, not full pages)
tvly crawl "https://docs.example.com" --instructions "Find authentication docs" --chunks-per-source 3 --json

Options

Option	Description
`--max-depth`	Levels deep (1-5, default: 1)
`--max-breadth`	Links per page (default: 20)
`--limit`	Total pages cap (default: 50)
`--instructions`	Natural language guidance for semantic focus
`--chunks-per-source`	Chunks per page (1-5, requires `--instructions`)
`--extract-depth`	`basic` (default) or `advanced`
`--format`	`markdown` (default) or `text`
`--select-paths`	Comma-separated regex patterns to include
`--exclude-paths`	Comma-separated regex patterns to exclude
`--select-domains`	Comma-separated regex for domains to include
`--exclude-domains`	Comma-separated regex for domains to exclude
`--allow-external / --no-external`	Include external links (default: allow)
`--include-images`	Include images
`--timeout`	Max wait (10-150 seconds)
`-o, --output`	Save JSON output to file
`--output-dir`	Save each page as a .md file in directory
`--json`	Structured JSON output

Crawl for context vs. data collection

For agentic use (feeding results to an LLM):

Always use --instructions + --chunks-per-source. Returns only relevant chunks instead of full pages — prevents context explosion.

tvly crawl "https://docs.example.com" --instructions "API authentication" --chunks-per-source 3 --json

For data collection (saving to files):

Use --output-dir without --chunks-per-source to get full pages as markdown files.

tvly crawl "https://docs.example.com" --max-depth 2 --output-dir ./docs/

Tips

Start conservative — --max-depth 1, --limit 20 — and scale up.
Use --select-paths to focus on the section you need.
Use map first to understand site structure before a full crawl.
Always set --limit to prevent runaway crawls.

來自 tavily-ai 的更多技能

crawl

tavily-ai

提取並將網站內容儲存為 Markdown 檔案，以供離線存取與分析。支援可設定的爬取深度（1-5層）、廣度限制與頁面上限，以平衡涵蓋範圍與效能。包含透過正則表達式模式進行路徑過濾，專注於特定區塊並排除無關內容。提供兩種模式：全頁提取用於資料收集，或語意分塊搭配自然語言指令，將結果導入 LLM 上下文。附帶配套的 Map API 用於 URL...

official

extract

tavily-ai

使用Tavily的提取API從特定URL提取乾淨內容。每次請求最多支援20個URL，可選基於查詢的重新排序以聚焦相關內容區塊。兩種提取模式：基本模式用於快速文字提取，進階模式用於JavaScript渲染頁面和結構化資料。首次運行時透過瀏覽器自動進行OAuth驗證，或手動在設定中配置API金鑰。返回Markdown或純文字格式，可選圖片URL，並可設定最長60秒的超時時間。

official

research

tavily-ai

針對任何主題進行全面研究，自動收集來源、分析並提供引用。執行多來源網路研究並附上明確引用，適合比較、時事、市場分析及詳細報告。提供三種模型選項：mini 針對單一主題的目標研究（約30秒）、pro 進行全面的多角度分析（約60-120秒），以及 auto 透過 API 驅動的複雜度偵測。透過 Tavily MCP 伺服器以 OAuth 進行驗證，並在...上自動執行基於瀏覽器的登入。

official

tavily-ai

official

tavily-best-practices

tavily-ai

專為LLM設計的網路搜尋API，具備即時資料存取、內容擷取、網站爬取及AI驅動研究功能。五大核心方法：search()用於搜尋網頁結果、extract()用於擷取URL內容、crawl()用於全站擷取、map()用於URL探索，以及research()用於端到端AI綜合分析。支援Python與JavaScript SDK，提供非同步客戶端以進行平行查詢，並可設定搜尋深度（極速/快速/基本/進階）。Crawl方法接受語意指令，以聚焦於特定內容的擷取...

official

tavily-cli

tavily-ai

透過 Tavily CLI 進行網路搜尋、內容提取、網站爬取與深度研究。五種指令模式涵蓋搜尋、提取、URL 發現、批量爬取及附引用來源的多來源研究。所有指令皆支援 JSON 輸出與檔案儲存，適用於結構化、代理式工作流程。升級模式引導您從簡單搜尋，逐步進展至提取、映射、爬取，乃至依需求進行的全面研究。需安裝 tavily-cli 並透過 tvly login 進行 API 金鑰驗證。

official

tavily-dynamic-search

tavily-ai

搜尋網路、篩選結果並擷取內容，讓原始搜尋資料絕不進入你的上下文視窗。只有你精心整理的 print() 輸出會回傳。

official

tavily-extract

tavily-ai

從最多20個URL中提取乾淨的Markdown或純文字，支援JavaScript渲染與查詢聚焦區塊切割。可處理JavaScript渲染頁面，並提供可設定的提取深度（基本模式適用於簡單頁面，進階模式適用於動態SPA與表格）。支援查詢聚焦提取，僅回傳相關內容區塊而非完整頁面。預設回傳經LLM最佳化的Markdown格式，亦可選擇純文字格式與結構化JSON輸出。單次呼叫可處理最多20個URL；...

official

tavily-crawl

tavily crawl

Before running any command

When to use

Quick start

Options

Crawl for context vs. data collection

Tips

See also

來自 tavily-ai 的更多技能