tavily-extract

作者： tavily-ai

从最多20个URL中提取干净的Markdown或文本，支持JavaScript渲染和查询聚焦分块。可处理JavaScript渲染页面，提取深度可配置（简单页面使用基础模式，动态SPA和表格使用高级模式）。支持查询聚焦提取，仅返回相关的内容块而非完整页面。默认返回经LLM优化的Markdown格式，也可选择纯文本格式和结构化JSON输出。单次调用最多处理20个URL；...

npx skills add https://github.com/tavily-ai/skills --skill tavily-extract

下载 ZIP GitHub

tavily extract

Extract clean markdown or text content from one or more URLs.

Before running any command

If tvly is not found on PATH, install it first:

curl -fsSL https://cli.tavily.com/install.sh | bash && tvly login

Do not skip this step or fall back to other tools.

See tavily-cli for alternative install methods and auth options.

When to use

You have a specific URL and want its content
You need text from JavaScript-rendered pages
Step 2 in the workflow: search → extract → map → crawl → research

Quick start

# Single URL
tvly extract "https://example.com/article" --json

# Multiple URLs
tvly extract "https://example.com/page1" "https://example.com/page2" --json

# Query-focused extraction (returns relevant chunks only)
tvly extract "https://example.com/docs" --query "authentication API" --chunks-per-source 3 --json

# JS-heavy pages
tvly extract "https://app.example.com" --extract-depth advanced --json

# Save to file
tvly extract "https://example.com/article" -o article.md

Options

Option	Description
`--query`	Rerank chunks by relevance to this query
`--chunks-per-source`	Chunks per URL (1-5, requires `--query`)
`--extract-depth`	`basic` (default) or `advanced` (for JS pages)
`--format`	`markdown` (default) or `text`
`--include-images`	Include image URLs
`--timeout`	Max wait time (1-60 seconds)
`-o, --output`	Save output to file
`--json`	Structured JSON output

Extract depth

Depth	When to use
`basic`	Simple pages, fast — try this first
`advanced`	JS-rendered SPAs, dynamic content, tables

Tips

Max 20 URLs per request — batch larger lists into multiple calls.
Use --query + --chunks-per-source to get only relevant content instead of full pages.
Try basic first, fall back to advanced if content is missing.
Set --timeout for slow pages (up to 60s).
If search results already contain the content you need (via --include-raw-content), skip the extract step.

See also

tavily-search — find pages when you don't have a URL
tavily-crawl — extract content from many pages on a site

来自 tavily-ai 的更多技能

提取网站内容并保存为markdown文件，以便离线访问和分析。支持可配置的爬取深度（1-5层）、广度限制和页面上限，以平衡覆盖范围与性能。通过正则表达式路径过滤，聚焦特定部分并排除无关内容。提供两种模式：全页提取用于数据收集，或基于自然语言指令的语义分块，将结果输入LLM上下文。附带Map API用于URL...

使用Tavily的提取API从特定URL中提取干净内容。每次请求最多支持20个URL，可选基于查询的重新排序以聚焦相关内容块。两种提取模式：基础模式用于快速文本提取，高级模式用于JavaScript渲染页面和结构化数据。首次运行时通过浏览器自动进行OAuth认证，或在设置中手动配置API密钥。返回markdown或纯文本格式，可选包含图片URL，超时时间可配置至多60秒。

针对任意主题进行综合研究，自动收集来源、分析并生成引用。通过多源网络研究并附带明确引用，适用于对比分析、时事追踪、市场调研及详细报告。提供三种模型选项：mini用于针对性单主题研究（约30秒），pro用于全面多角度分析（约60-120秒），auto通过API自动检测复杂度。通过Tavily MCP服务器的OAuth进行身份验证，支持基于浏览器的自动登录...

基于LLM优化的网页搜索，具备相关性评分与灵活筛选功能。支持四种搜索深度模式（极速、快速、基础、高级），可配置延迟与相关性权衡。包含域名过滤、时间范围限制、日期区间、国家加权及原始内容提取。返回结果含标题、URL、内容摘要及相关性评分；可选图片结果与网站图标。通过Tavily MCP服务器或API密钥配置实现自动OAuth认证；...

tavily-best-practices

面向LLM的网页搜索API，支持实时数据访问、内容提取、站点爬取及AI驱动研究。五大核心方法：search()获取网页结果，extract()提取URL内容，crawl()进行全站提取，map()发现URL，research()实现端到端AI综合。提供Python和JavaScript SDK，支持异步客户端并行查询及可配置搜索深度（极速/快速/基础/高级）。crawl方法接受语义指令以聚焦提取内容...

通过Tavily CLI实现网页搜索、内容提取、站点爬取及深度研究。五种命令模式涵盖搜索、提取、URL发现、批量爬取及带引用的多源研究。所有命令支持JSON输出及文件保存，适用于结构化、智能体工作流。升级模式根据需求引导您从简单搜索逐步过渡到提取、映射、爬取乃至全面研究。需安装tavily-cli并通过tvly login进行API密钥认证。

多页网站爬虫，具备语义过滤和Markdown导出功能。可控制深度和广度爬取整个网站部分；通过路径正则表达式、域名或自然语言指令过滤结果，聚焦所需内容。使用--output-dir参数将每个页面保存为本地Markdown文件，或返回结构化JSON供智能体处理。采用语义指令与分块提取，防止向LLM输入结果时上下文膨胀；支持全页提取用于离线文档下载。支持...

tavily-dynamic-search

搜索网络、过滤结果并提取内容，使原始搜索数据永远不会进入您的上下文窗口。只有您精心整理的print()输出会返回。