MCP Go Colly Crawler
A web crawling framework that integrates the Model Context Protocol (MCP) with the Colly web scraping library.
MCP Go Colly Crawler
Overview
MCP Go Colly is a sophisticated web crawling framework that integrates the Model Context Protocol (MCP) with the powerful Colly web scraping library. This project aims to provide a flexible and extensible solution for extracting web content for large language model (LLM) applications.
Features
- Concurrent web crawling with configurable depth and domain restrictions
- MCP server integration for tool-based crawling
- Graceful shutdown handling
- Robust error handling and result formatting
- Support for both single URL and batch URL crawling
Building from Source
Prerequisites
- Go 1.21 or later
- Make (for using Makefile commands)
Installation
- Clone the repository:
git clone https://github.com/yourusername/mcp-go-colly.git
cd mcp-go-colly
- Install dependencies:
make deps
Building
The project includes a Makefile with several useful commands:
# Build the binary (outputs to bin/mcp-go-colly)
make build
# Build for all platforms (Linux, Windows, macOS)
make build-all
# Run tests
make test
# Clean build artifacts
make clean
# Format code
make fmt
# Run linter
make lint
All binaries will be generated in the bin/ directory.
Then you need to add the following configuration to the claude_desktop_config.json file:
{
"mcpServers": {
"web-scraper": {
"command": "<add path here>/mcp-go-colly/bin/mcp-go-colly"
}
}
}
Usage
As an MCP Tool
The crawler is implemented as an MCP tool that can be called with the following parameters:
{
"urls": ["https://example.com"], // Single URL or array of URLs
"max_depth": 2 // Optional: Maximum crawl depth (default: 2)
}
Example MCP Tool Call
result, err := crawlerTool.Call(ctx, mcp.CallToolRequest{
Params: struct{ Arguments map[string]interface{} }{
Arguments: map[string]interface{}{
"urls": []string{"https://example.com"},
"max_depth": 2,
},
},
})
Configuration Options
max_depth: Set maximum crawl depth (default: 2)urls: Single URL string or array of URLs to crawl- Domain restrictions are automatically applied based on the provided URLs
Contributing
- Fork the repository
- Create your feature branch
- Commit your changes
- Push to the branch
- Create a Pull Request
License
MIT
Acknowledgments
- Colly Web Scraping Framework
- Mark3 Labs MCP Project
Serveurs connexes
Bright Data
sponsorDiscover, extract, and interact with the web - one interface powering automated access across the public internet.
YouTube Transcript MCP Server
A high-performance MCP server for fetching YouTube video transcripts, with support for caching, rate limiting, and proxy rotation.
scrape-do-mcp
MCP Server for Scrape.do - Web Scraping & Google Search with anti-bot bypass
yt-dlp
Download video and audio content from various websites like YouTube, Facebook, and Tiktok using yt-dlp.
Document Extractor MCP Server
Extracts document content from Microsoft Learn and GitHub URLs and stores it in PocketBase for retrieval and search.
Haunt API
AI-powered web data extraction MCP server — extract structured JSON from any website with natural language prompts.
ElToque MCP Server
Fetches USD and EUR prices from the Cuban parallel market via eltoque.com.
BrowserAct
BrowserAct MCP Server is a standardized MCP service that lets MCP clients connect to the BrowserAct platform to discover and run browser automation workflows, access results/files and related storage, and trigger real-world actions via natural language.
Playwright MCP
Control a browser for automation and web scraping tasks using Playwright.
Driflyte
The Driflyte MCP Server exposes tools that allow AI assistants to query and retrieve topic-specific knowledge from recursively crawled and indexed web pages.
Horse Racing News
Fetches horse racing news from the thoroughbreddailynews.com RSS feed.