Web-curl
Fetch, extract, and process web and API content. Supports resource blocking, authentication, and Google Custom Search.
Google Custom Search API
Google Custom Search API is free with usage limits (e.g., 100 queries per day for free, with additional queries requiring payment). For full details on quotas, pricing, and restrictions, see the official documentation.
Web-curl

Developed by Rayss
π Open Source Project
π οΈ Built with Node.js & TypeScript (Node.js v18+ required)
π¬ Demo Video
Click here to watch the demo video directly in your browser.
If your platform supports it, you can also download and play demo/demo_1.mp4 directly.
π Table of Contents
- Changelog / Update History
- Overview
- Features
- Architecture
- Installation
- Usage
- Configuration
- Examples
- Troubleshooting
- Tips & Best Practices
- Contributing & Issues
- License & Attribution
π Changelog / Update History
See CHANGELOG.md for a complete history of updates and new features.
π Overview
Web-curl is a powerful tool for fetching and extracting text content from web pages and APIs. Use it as a standalone CLI or as an MCP (Model Context Protocol) server. Web-curl leverages Puppeteer for robust web scraping and supports advanced features such as resource blocking, custom headers, authentication, and Google Custom Search.
β¨ Features
Storage & Download Details
- ποΈ Error log rotation:
logs/error-log.txtis rotated when it exceeds ~1MB (renamed toerror-log.txt.bak) to prevent unbounded growth. - π§Ή Logs & temp cleanup: old temporary files in the
logs/directory are cleaned up at startup. - π Browser lifecycle: Puppeteer browser instances are closed in finally blocks to avoid Chromium temp file leaks.
- π Content extraction:
- Returns raw text, HTML, and Readability "main article" when available. Readability attempts to extract the primary content of a webpage, removing headers, footers, sidebars, and other non-essential elements, providing a cleaner, more focused text.
- Readability output is subject to
startIndex/maxLength/chunkSizeslicing when requested.
- π« Resource blocking:
blockResourcesis now always forced tofalse, meaning resources are never blocked for faster page loads. - β±οΈ Timeout control: navigation and API request timeouts are configurable via tool arguments.
- πΎ Output: results can be printed to stdout or written to a file via CLI options.
- β¬οΈ Download behavior (
download_file):destinationFolderaccepts relative paths (resolved againstprocess.cwd()) or absolute paths.- The server creates
destinationFolderif it does not exist. - Downloads are streamed using Node streams +
pipelineto minimize memory use and ensure robust writes. - Filenames are derived from the URL path (e.g.,
https://.../path/file.jpg->file.jpg). If no filename is present, the fallback name isdownloaded_file. - Overwrite semantics: by default the implementation will overwrite an existing file with the same name. To avoid overwrite, provide a unique
destinationFolderor include a unique filename (timestamp, uuid) in the URL path or destination prior to calling the tool. (Optionally the code can be extended to support anoOverwriteflag to auto-rename filesβask if you want this implemented.) - Error handling: non-2xx responses cause a thrown error; partial writes are avoided by streaming through
pipelineand only returning the final path on success.
- π₯οΈ Usage modes: CLI and MCP server (stdin/stdout transport).
- π REST client:
fetch_apireturns JSON/text when appropriate and base64 for binary responses. - Note:
fetch_apinow requires a numericlimitparameter; responses will be truncated to at mostlimitcharacters. The response object includesbodyLength(original length in characters) andtruncated(boolean). fetch_apiis markedautoApprovein the MCP tool listing so compatible MCP hosts may invoke it without interactive approval. Internal calls in this codebase use a sensible defaultlimitof 1000 characters where applicable.- π Google Custom Search: requires
APIKEY_GOOGLE_SEARCHandCX_GOOGLE_SEARCH. - π€ Smart command:
- Auto language detection (franc-min) and optional translation (dynamic
translateimport). Translation is a best-effort fallback and may fail silently; original text is preserved on failure. - Query enrichment is heuristic-based; results depend on the detected intent.
- Auto language detection (franc-min) and optional translation (dynamic
- π fetch_webpage specifics:
- Multi-page crawling via
nextPageSelector(tries href first, falls back to clicking the element). - Content is now whitespace-removed from the entire HTML before slicing.
- Returns sliced content, total characters (after whitespace removal),
startIndex,maxLength,remainingCharacters, and aninstructionfor fetching more content (includes a suggestion to stop if enough information is gathered). - Required parameters:
startIndex(or aliasindex) and at least one ofchunkSize(preferred),limit(alias), ormaxLengthmust be provided and be a number. Calls missing these required parameters will be rejected with an InvalidParams error. Set these values according to your needs; they may not be empty. - Validation behavior: runtime validation is enforced in
src/index.tsand the MCP tool will throw/reject when required parameters are missing or invalid. If you prefer automatic fallbacks instead of rejection, modify the validation logic insrc/index.ts.
- Multi-page crawling via
- π‘οΈ Debug & Logging
- Runtime logs: detailed runtime errors and debug traces are written to
logs/error-log.txtby default. - Debug flag: some CLI/tool paths accept a
debugargument which enables more verbose console logging; not all code paths consistently honor adebugflag yet. Prefer inspectinglogs/error-log.txtfor complete traces. - To enable console-level debug consistently, a small code change to read a
DEBUG=trueenv var or a global--debugCLI option can be added (recommended for development).
- Runtime logs: detailed runtime errors and debug traces are written to
- βοΈ Compatibility & Build notes
- The project now utilizes the native global
fetchAPI available in Node.js 18+, eliminating the need for thenode-fetchdependency. This simplifies the dependency tree and leverages built-in capabilities. npm run buildrunstscand achmodstep that is no-op on Windows; CI or cross-platform scripts should guardchmodwith a platform check.
- The project now utilizes the native global
- π Security considerations
- SSRF: validate/whitelist destination hosts if exposing
fetch_api/fetch_webpagepublicly. - Rate limiting & auth: add request rate limiting and access controls for public deployments.
- Puppeteer flags:
--no-sandboxreduces isolation; only use it where required and understand the risk on multi-tenant systems.
- SSRF: validate/whitelist destination hosts if exposing
- π§ͺ Tests & linting
- Linting:
npm run lintis provided; including a pre-commit hook (e.g., usinghuskyandlint-staged) is recommended to enforce linting standards before commits, ensuring code quality. - Tests: Currently, no unit tests are included. Future plans involve adding comprehensive integration tests for core functionalities like
fetch_apianddownload_fileto ensure reliability and prevent regressions.
- Linting:
- π All tool schemas and documentation are in English for clarity.
ποΈ Architecture
This section outlines the high-level architecture of Web-curl.
graph TD
A[User/MCP Host] --> B(CLI / MCP Server)
B --> C{Tool Handlers}
C -- fetch_webpage --> D["Puppeteer (Web Scraping)"]
C -- fetch_api --> E["REST Client"]
C -- google_search --> F["Google Custom Search API"]
C -- smart_command --> G["Language Detection & Translation"]
C -- download_file --> H["File System (Downloads)"]
D --> I["Web Content"]
E --> J["External APIs"]
F --> K["Google Search Results"]
H --> L["Local Storage"]
- CLI & MCP Server:
src/index.tsImplements both the CLI entry point and the MCP server, exposing tools likefetch_webpage,fetch_api,google_search, andsmart_command. - Web Scraping: Uses Puppeteer for headless browsing, resource blocking, and content extraction.
- REST Client:
src/rest-client.tsProvides a flexible HTTP client for API requests, used by both CLI and MCP tools. - Configuration: Managed via CLI options, environment variables, and tool arguments.
- Note: the server creates
logs/at startup and resolves relative paths againstprocess.cwd(). Tools exposed includedownload_file(streaming writes),fetch_webpage,fetch_api,google_search, andsmart_command.
- Note: the server creates
βοΈ MCP Server Configuration Example
To integrate web-curl as an MCP server, add the following configuration to your mcp_settings.json:
{
"mcpServers": {
"web-curl": {
"command": "node",
"args": [
"build/index.js"
],
"disabled": false,
"alwaysAllow": [
"fetch_webpage",
"fetch_api",
"google_search",
"smart_command",
"download_file"
],
"env": {
"APIKEY_GOOGLE_SEARCH": "YOUR_GOOGLE_API_KEY",
"CX_GOOGLE_SEARCH": "YOUR_CX_ID"
}
}
}
}
π How to Obtain Google API Key and CX
-
Get a Google API Key:
- Go to Google Cloud Console.
- Create/select a project, then go to APIs & Services > Credentials.
- Click Create Credentials > API key and copy it.
- Note: API key activation might take some time. Also be aware of Google's usage quotas for the free tier.
-
Get a Custom Search Engine (CX) ID:
- Go to Google Custom Search Engine.
- Create/select a search engine, then copy the Search engine ID (CX).
-
Enable Custom Search API:
- In Google Cloud Console, go to APIs & Services > Library.
- Search for Custom Search API and enable it.
Replace YOUR_GOOGLE_API_KEY and YOUR_CX_ID in the config above.
π οΈ Installation
# Clone the repository
git clone https://github.com/rayss868/MCP-Web-Curl
cd web-curl
# Install dependencies
npm install
# Build the project
npm run build
- Prerequisites: Ensure you have Node.js (v18+) and Git installed on your system.
Puppeteer installation notes
-
Windows: Just run
npm install. -
Linux: You must install extra dependencies for Chromium. Run:
sudo apt-get install -y \ ca-certificates fonts-liberation libappindicator3-1 libasound2 libatk-bridge2.0-0 \ libatk1.0-0 libcups2 libdbus-1-3 libdrm2 libgbm1 libnspr4 libnss3 \ libx11-xcb1 libxcomposite1 libxdamage1 libxrandr2 xdg-utilsFor more details, see the Puppeteer troubleshooting guide.
π Usage
CLI Usage
The CLI supports fetching and extracting text content from web pages.
# Basic usage
node build/index.js https://example.com
# With options
node build/index.js --timeout 30000 --no-block-resources https://example.com
# Save output to a file
node build/index.js -o result.json https://example.com
Command Line Options
--timeout <ms>: Set navigation timeout (default: 60000)--no-block-resources: This option is now deprecated as resource blocking is always disabled by default.-o <file>: Output result to specified file
MCP Server Usage
Web-curl can be run as an MCP server for integration with Roo Context or other MCP-compatible environments.
Exposed Tools
- fetch_webpage: Retrieve text, html, main article content, and metadata from a web page. Supports multi-page crawling (pagination) and debug mode.
- fetch_api: Make REST API requests with custom methods, headers, body, timeout, and debug mode.
- google_search: Search the web using Google Custom Search API, with advanced filters (language, region, site, dateRestrict) and debug mode.
- smart_command: Free-form command with automatic language detection, translation, query enrichment, and debug mode.
- download_file: Download a file from a given URL to a specified folder.
Running as MCP Server
npm run start
The server will communicate via stdin/stdout and expose the tools as defined in src/index.ts.
MCP Tool Example (fetch_webpage)
{
"name": "fetch_webpage",
"arguments": {
"url": "https://example.com",
"timeout": 60000,
"maxLength": 10000
}
}
π¦ Content Slicing Example (Recommended for Large Pages)
For large documents, you can fetch content in slices using startIndex and maxLength. The server will return the sliced content, the total characters available (after whitespace removal), and an instruction for fetching the next part.
Client request for first slice:
{
"name": "fetch_webpage",
"arguments": {
"url": "https://example.com/long-article",
"blockResources": false,
"timeout": 60000,
"maxLength": 2000, // maximum number of characters to return for this slice
"startIndex": 0
}
}
Server response (example):
{
"url": "https://example.com/long-article",
"title": "Long Article Title",
"content": "First 2000 characters of the whitespace-removed HTML...",
"fetchedAt": "2025-08-19T15:00:00.000Z",
"startIndex": 0,
"maxLength": 2000,
"remainingCharacters": 8000, // Total characters - (startIndex + content.length)
"instruction": "To fetch more content, call fetch_webpage again with startIndex=2000."
}
Client fetches the next slice by setting startIndex to the previous startIndex + content.length:
{
"name": "fetch_webpage",
"arguments": {
"url": "https://example.com/long-article",
"maxLength": 2000,
"startIndex": 2000 // From the instruction in the previous response
}
}
- Continue fetching until
remainingCharactersis 0 and theinstructionindicates all content has been fetched. - The
contentfield will contain the sliced, whitespace-removed HTML.
Google Search Integration
Set the following environment variables for Google Custom Search:
APIKEY_GOOGLE_SEARCH: Your Google API keyCX_GOOGLE_SEARCH: Your Custom Search Engine ID
π§© Configuration
- Resource Blocking: Block images, stylesheets, and fonts for faster page loading.
- Timeout: Set navigation and API request timeouts.
- Custom Headers: Pass custom HTTP headers for advanced scenarios.
- Authentication: Supports HTTP Basic Auth via username/password.
- Environment Variables: Used for Google Search API integration.
π‘ Examples {#examples}
{
"name": "fetch_webpage",
"arguments": {
"url": "https://en.wikipedia.org/wiki/Web_scraping",
"blockResources": true,
"maxLength": 5000,
"nextPageSelector": ".pagination-next a",
"maxPages": 3,
"debug": true
}
}
{
"name": "fetch_api",
"arguments": {
"url": "https://api.github.com/repos/nodejs/node",
"method": "GET",
"headers": {
"Accept": "application/vnd.github.v3+json"
}
}
}
{
"name": "google_search",
"arguments": {
"query": "web scraping best practices",
"num": 5,
"language": "lang_en",
"region": "US",
"site": "wikipedia.org",
"dateRestrict": "w1",
"debug": true
}
}
{
"name": "download_file",
"arguments": {
"url": "https://example.com/image.jpg",
"destinationFolder": "downloads"
}
}
Note: destinationFolder can be either a relative path (resolved against the current working directory, process.cwd()) or an absolute path. The server will create the destination folder if it does not exist.
π οΈ Troubleshooting {#troubleshooting}
- Timeout Errors: Increase the
timeoutparameter if requests are timing out. - Blocked Content: If content is missing, try disabling resource blocking or adjusting
resourceTypesToBlock. - Google Search Fails: Ensure
APIKEY_GOOGLE_SEARCHandCX_GOOGLE_SEARCHare set in your environment. - Binary/Unknown Content: Non-text responses are base64-encoded.
- Error Logs: Check the
logs/error-log.txtfile for detailed error messages.
π§ Tips & Best Practices {#tips--best-practices}
- Resource blocking is always disabled, ensuring faster and lighter scraping.
- For large pages, use
maxLengthandstartIndexto fetch content in slices. The server will provideremainingCharactersand aninstructionfor fetching the next part (includes a suggestion to stop if enough information is gathered). - Always validate your tool arguments to avoid errors.
- Secure your API keys and sensitive data using environment variables.
- Review the MCP tool schemas in
src/index.tsfor all available options.
π€ Contributing & Issues {#contributing--issues}
Contributions are welcome! If you want to contribute, fork this repository and submit a pull request.
If you find any issues or have suggestions, please open an issue on the repository page.
π License & Attribution {#license--attribution}
This project was developed by Rayss.
For questions, improvements, or contributions, please contact the author or open an issue in the repository.
Note: Google Search API is free with usage limits. For details, see: Google Custom Search API Overview
Related Servers
YouTube Data
Access YouTube video data and transcripts using the YouTube Data API.
Browser Use
Automate browser tasks using the Browser Use API.
MCP Server Collector
Discovers and collects MCP servers from the internet.
Agentic Deep Researcher
A deep research agent powered by Crew AI and the LinkUp API.
Mention MCP Server
Monitor web and social media using the Mention API.
Document Extractor MCP Server
Extracts document content from Microsoft Learn and GitHub URLs and stores it in PocketBase for retrieval and search.
Crawl4AI RAG
Integrate web crawling and Retrieval-Augmented Generation (RAG) into AI agents and coding assistants.
Any Browser MCP
Attaches to existing browser sessions using the Chrome DevTools Protocol for automation and interaction.
urlDNA
Dynamically scan and analyze potentially malicious URLs using the urlDNA.io
Olostep MCP Server
A server for web scraping, Google searches, and website URL lookups using the Olostep API.