Fetch, extract, and process web and API content. Supports resource blocking, authentication, and Google Custom Search.
Google Custom Search API is free with usage limits (e.g., 100 queries per day for free, with additional queries requiring payment). For full details on quotas, pricing, and restrictions, see the official documentation.
Developed by Rayss
π Open Source Project
π οΈ Built with Node.js & TypeScript (Node.js v18+ required)
Click here to watch the demo video directly in your browser.
If your platform supports it, you can also download and play demo/demo_1.mp4 directly.
See CHANGELOG.md for a complete history of updates and new features.
Web-curl is a powerful tool for fetching and extracting text content from web pages and APIs. Use it as a standalone CLI or as an MCP (Model Context Protocol) server. Web-curl leverages Puppeteer for robust web scraping and supports advanced features such as resource blocking, custom headers, authentication, and Google Custom Search.
logs/error-log.txt
is rotated when it exceeds ~1MB (renamed to error-log.txt.bak
) to prevent unbounded growth.logs/
directory are cleaned up at startup.startIndex
/maxLength
/chunkSize
slicing when requested.blockResources
is now always forced to false
, meaning resources are never blocked for faster page loads.download_file
):
destinationFolder
accepts relative paths (resolved against process.cwd()
) or absolute paths.destinationFolder
if it does not exist.pipeline
to minimize memory use and ensure robust writes.https://.../path/file.jpg
-> file.jpg
). If no filename is present, the fallback name is downloaded_file
.destinationFolder
or include a unique filename (timestamp, uuid) in the URL path or destination prior to calling the tool. (Optionally the code can be extended to support a noOverwrite
flag to auto-rename filesβask if you want this implemented.)pipeline
and only returning the final path on success.fetch_api
returns JSON/text when appropriate and base64 for binary responses.fetch_api
now requires a numeric limit
parameter; responses will be truncated to at most limit
characters. The response object includes bodyLength
(original length in characters) and truncated
(boolean).fetch_api
is marked autoApprove
in the MCP tool listing so compatible MCP hosts may invoke it without interactive approval. Internal calls in this codebase use a sensible default limit
of 1000 characters where applicable.APIKEY_GOOGLE_SEARCH
and CX_GOOGLE_SEARCH
.translate
import). Translation is a best-effort fallback and may fail silently; original text is preserved on failure.nextPageSelector
(tries href first, falls back to clicking the element).startIndex
, maxLength
, remainingCharacters
, and an instruction
for fetching more content (includes a suggestion to stop if enough information is gathered).startIndex
(or alias index
) and at least one of chunkSize
(preferred), limit
(alias), or maxLength
must be provided and be a number. Calls missing these required parameters will be rejected with an InvalidParams error. Set these values according to your needs; they may not be empty.src/index.ts
and the MCP tool will throw/reject when required parameters are missing or invalid. If you prefer automatic fallbacks instead of rejection, modify the validation logic in src/index.ts
.logs/error-log.txt
by default.debug
argument which enables more verbose console logging; not all code paths consistently honor a debug
flag yet. Prefer inspecting logs/error-log.txt
for complete traces.DEBUG=true
env var or a global --debug
CLI option can be added (recommended for development).fetch
API available in Node.js 18+, eliminating the need for the node-fetch
dependency. This simplifies the dependency tree and leverages built-in capabilities.npm run build
runs tsc
and a chmod
step that is no-op on Windows; CI or cross-platform scripts should guard chmod
with a platform check.fetch_api
/fetch_webpage
publicly.--no-sandbox
reduces isolation; only use it where required and understand the risk on multi-tenant systems.npm run lint
is provided; including a pre-commit hook (e.g., using husky
and lint-staged
) is recommended to enforce linting standards before commits, ensuring code quality.fetch_api
and download_file
to ensure reliability and prevent regressions.This section outlines the high-level architecture of Web-curl.
graph TD
A[User/MCP Host] --> B(CLI / MCP Server)
B --> C{Tool Handlers}
C -- fetch_webpage --> D["Puppeteer (Web Scraping)"]
C -- fetch_api --> E["REST Client"]
C -- google_search --> F["Google Custom Search API"]
C -- smart_command --> G["Language Detection & Translation"]
C -- download_file --> H["File System (Downloads)"]
D --> I["Web Content"]
E --> J["External APIs"]
F --> K["Google Search Results"]
H --> L["Local Storage"]
src/index.ts
Implements both the CLI entry point and the MCP server, exposing tools like fetch_webpage
, fetch_api
, google_search
, and smart_command
.src/rest-client.ts
Provides a flexible HTTP client for API requests, used by both CLI and MCP tools.logs/
at startup and resolves relative paths against process.cwd()
. Tools exposed include download_file
(streaming writes), fetch_webpage
, fetch_api
, google_search
, and smart_command
.To integrate web-curl as an MCP server, add the following configuration to your mcp_settings.json
:
{
"mcpServers": {
"web-curl": {
"command": "node",
"args": [
"build/index.js"
],
"disabled": false,
"alwaysAllow": [
"fetch_webpage",
"fetch_api",
"google_search",
"smart_command",
"download_file"
],
"env": {
"APIKEY_GOOGLE_SEARCH": "YOUR_GOOGLE_API_KEY",
"CX_GOOGLE_SEARCH": "YOUR_CX_ID"
}
}
}
}
Get a Google API Key:
Get a Custom Search Engine (CX) ID:
Enable Custom Search API:
Replace YOUR_GOOGLE_API_KEY
and YOUR_CX_ID
in the config above.
# Clone the repository
git clone https://github.com/rayss868/MCP-Web-Curl
cd web-curl
# Install dependencies
npm install
# Build the project
npm run build
Windows: Just run npm install
.
Linux: You must install extra dependencies for Chromium. Run:
sudo apt-get install -y \
ca-certificates fonts-liberation libappindicator3-1 libasound2 libatk-bridge2.0-0 \
libatk1.0-0 libcups2 libdbus-1-3 libdrm2 libgbm1 libnspr4 libnss3 \
libx11-xcb1 libxcomposite1 libxdamage1 libxrandr2 xdg-utils
For more details, see the Puppeteer troubleshooting guide.
The CLI supports fetching and extracting text content from web pages.
# Basic usage
node build/index.js https://example.com
# With options
node build/index.js --timeout 30000 --no-block-resources https://example.com
# Save output to a file
node build/index.js -o result.json https://example.com
--timeout <ms>
: Set navigation timeout (default: 60000)--no-block-resources
: This option is now deprecated as resource blocking is always disabled by default.-o <file>
: Output result to specified fileWeb-curl can be run as an MCP server for integration with Roo Context or other MCP-compatible environments.
npm run start
The server will communicate via stdin/stdout and expose the tools as defined in src/index.ts
.
{
"name": "fetch_webpage",
"arguments": {
"url": "https://example.com",
"timeout": 60000,
"maxLength": 10000
}
}
For large documents, you can fetch content in slices using startIndex
and maxLength
. The server will return the sliced content, the total characters available (after whitespace removal), and an instruction for fetching the next part.
Client request for first slice:
{
"name": "fetch_webpage",
"arguments": {
"url": "https://example.com/long-article",
"blockResources": false,
"timeout": 60000,
"maxLength": 2000, // maximum number of characters to return for this slice
"startIndex": 0
}
}
Server response (example):
{
"url": "https://example.com/long-article",
"title": "Long Article Title",
"content": "First 2000 characters of the whitespace-removed HTML...",
"fetchedAt": "2025-08-19T15:00:00.000Z",
"startIndex": 0,
"maxLength": 2000,
"remainingCharacters": 8000, // Total characters - (startIndex + content.length)
"instruction": "To fetch more content, call fetch_webpage again with startIndex=2000."
}
Client fetches the next slice by setting startIndex
to the previous startIndex + content.length
:
{
"name": "fetch_webpage",
"arguments": {
"url": "https://example.com/long-article",
"maxLength": 2000,
"startIndex": 2000 // From the instruction in the previous response
}
}
remainingCharacters
is 0 and the instruction
indicates all content has been fetched.content
field will contain the sliced, whitespace-removed HTML.Set the following environment variables for Google Custom Search:
APIKEY_GOOGLE_SEARCH
: Your Google API keyCX_GOOGLE_SEARCH
: Your Custom Search Engine ID{
"name": "fetch_webpage",
"arguments": {
"url": "https://en.wikipedia.org/wiki/Web_scraping",
"blockResources": true,
"maxLength": 5000,
"nextPageSelector": ".pagination-next a",
"maxPages": 3,
"debug": true
}
}
{
"name": "fetch_api",
"arguments": {
"url": "https://api.github.com/repos/nodejs/node",
"method": "GET",
"headers": {
"Accept": "application/vnd.github.v3+json"
}
}
}
{
"name": "google_search",
"arguments": {
"query": "web scraping best practices",
"num": 5,
"language": "lang_en",
"region": "US",
"site": "wikipedia.org",
"dateRestrict": "w1",
"debug": true
}
}
{
"name": "download_file",
"arguments": {
"url": "https://example.com/image.jpg",
"destinationFolder": "downloads"
}
}
Note: destinationFolder
can be either a relative path (resolved against the current working directory, process.cwd()
) or an absolute path. The server will create the destination folder if it does not exist.
timeout
parameter if requests are timing out.resourceTypesToBlock
.APIKEY_GOOGLE_SEARCH
and CX_GOOGLE_SEARCH
are set in your environment.logs/error-log.txt
file for detailed error messages.maxLength
and startIndex
to fetch content in slices. The server will provide remainingCharacters
and an instruction
for fetching the next part (includes a suggestion to stop if enough information is gathered).src/index.ts
for all available options.Contributions are welcome! If you want to contribute, fork this repository and submit a pull request.
If you find any issues or have suggestions, please open an issue on the repository page.
This project was developed by Rayss.
For questions, improvements, or contributions, please contact the author or open an issue in the repository.
Note: Google Search API is free with usage limits. For details, see: Google Custom Search API Overview
Browser automation using Puppeteer, with support for local, Docker, and Cloudflare Workers deployments.
Interact with WebScraping.AI for web data extraction and scraping.
Fetches and converts website content to Markdown with AI-powered cleanup, OpenAPI support, and stealth browsing.
Access Outscraper's data extraction services for business intelligence, location data, reviews, and contact information from various online platforms.
Enables powerful, detection-resistant browser automation for AI assistants using puppeteer-real-browser.
Fetches content from any URL and converts it to HTML, JSON, Markdown, or plain text.
AI tools for web scraping, crawling, browser control, and web search via the Oxylabs AI Studio API.
Interact with Yahoo Finance to get stock data, market news, and financial information using the yfinance Python library.
A MCP server that provides comprehensive website snapshot capabilities using Playwright. This server enables LLMs to capture and analyze web pages through structured accessibility snapshots, network monitoring, and console message collection.
A server for browser automation using Google Chrome, based on the MCP framework.