Fetch as Markdown MCP Server

Fetches web pages and converts them to clean markdown, focusing on main content extraction.

Documentation

Fetch as Markdown MCP Server

A Model Context Protocol (MCP) server that fetches web pages and converts them to clean, readable markdown format, focusing on main content extraction while minimizing context overhead.

Overview

This MCP server acts as a bridge between AI assistants and the web, specifically designed to:

Extract Clean Content: Focuses on main article content, removing navigation, ads, and sidebars
Minimize Context: Strips unnecessary elements to reduce token usage while preserving content structure
Respectful Scraping: Implements proper rate limiting, user-agent headers, and timeout handling
Error Resilience: Gracefully handles various web-related errors and edge cases

Features

🌐 Web Page Fetching

Fetch any publicly accessible web page
Automatic redirect handling with final URL reporting
Configurable timeouts and proper error handling
Respectful rate limiting (1-second intervals between requests)

🧹 Content Cleaning

Removes navigation, ads, sidebars, and other non-essential elements
Focuses on main content areas using semantic HTML detection
Strips unnecessary HTML attributes to reduce token usage
Preserves content structure and readability

📝 Markdown Conversion

Converts HTML to clean, readable markdown
Configurable link and image inclusion
Proper heading hierarchy and formatting
Post-processing to remove excessive whitespace

Installation & Setup

Prerequisites

Python 3.12 or higher
uv package manager

Quick Start

Run directly with uvx:

uvx git+https://github.com/bhubbb/mcp-fetch-as-markdown

Or install locally:

Clone or download this project
Install dependencies:
```
cd mcp-fetch-as-markdown
uv sync
```
Run the server:
```
uv run python main.py
```

Integration with AI Assistants

This MCP server is designed to work with AI assistants that support the Model Context Protocol. Configure your AI assistant to connect to this server via stdio.

Example configuration for Claude Desktop:

{
  "mcpServers": {
    "fetch-as-markdown": {
      "command": "uvx",
      "args": ["git+https://github.com/bhubbb/mcp-fetch-as-markdown"]
    }
  }
}

Or if using a local installation:

{
  "mcpServers": {
    "fetch-as-markdown": {
      "command": "uv",
      "args": ["run", "python", "/path/to/mcp-fetch-as-markdown/main.py"]
    }
  }
}

Available Tools

`fetch`

Fetch a web page and convert it to clean markdown format.

Parameters:

url (required): URL of the web page to fetch and convert
include_links (optional): Whether to preserve links in markdown output (default: true)
include_images (optional): Whether to include image references (default: false)
timeout (optional): Request timeout in seconds (5-30, default: 10)

Returns:

Fetch metadata (original URL, final URL, title, content length, status code, content type)
Clean markdown content with proper formatting

Example:

{
  "name": "fetch",
  "arguments": {
    "url": "https://example.com/article",
    "include_links": true,
    "include_images": false,
    "timeout": 15
  }
}

How It Works

Content Extraction Strategy

Fetch Page: Makes HTTP request with proper headers and timeout handling
Parse HTML: Uses BeautifulSoup to parse the HTML content
Remove Unwanted Elements: Strips scripts, styles, navigation, ads, sidebars, footers
Find Main Content: Looks for semantic elements like <main>, <article>, or common content classes
Clean Attributes: Removes unnecessary HTML attributes to reduce size
Convert to Markdown: Uses configurable markdown conversion with proper formatting
Post-process: Removes excessive whitespace and blank lines

Respectful Web Scraping

Rate Limiting: Minimum 1-second interval between requests
User Agent: Proper identification as "MCP-Fetch-As-Markdown" tool
Timeout Handling: Configurable timeouts to avoid hanging requests
Error Handling: Graceful handling of network issues, HTTP errors, and malformed content
Redirect Support: Follows redirects and reports final URLs

Structured Output Format

All responses include:

Metadata Block: Original URL, final URL, page title, content statistics, HTTP status
Content Block: Clean markdown conversion of the main page content

This structure makes responses both human-readable and machine-parseable while minimizing token usage.

Error Handling

Invalid URLs: Clear validation and error messages
Network Issues: Timeout, connection error, and DNS failure handling
HTTP Errors: Proper handling of 404, 500, and other HTTP status codes
Malformed Content: Graceful handling of broken HTML and encoding issues

Use Cases

For Research & Analysis

Convert articles and blog posts to clean markdown for analysis
Extract main content from news articles and research papers
Gather information while minimizing irrelevant context

For Content Processing

Prepare web content for further AI processing
Extract clean text from web pages for summarization
Convert HTML content to markdown for documentation

For AI Assistants

Fetch and process web content with minimal token overhead
Extract relevant information while filtering out noise
Provide clean, structured content for AI reasoning

Examples

Basic Page Fetching

Ask your AI assistant: "Fetch the content from this article URL as markdown"

The server will:

Fetch the web page with proper headers and rate limiting
Extract the main content area, removing navigation and ads
Convert to clean markdown format
Return structured metadata and content

With Link Preservation

Ask your AI assistant: "Fetch this page but keep all the links intact"

The server will:

Fetch and process the page normally
Preserve all hyperlinks in markdown format [text](url)
Maintain link structure while cleaning other elements

Error Handling Example

Ask your AI assistant: "Try to fetch content from this broken URL"

The server will:

Validate the URL format
Attempt the request with proper timeout
Return a structured error message if the request fails
Provide helpful information about what went wrong

Development

Project Structure

mcp-fetch-as-markdown/
├── main.py          # Main MCP server implementation
├── pyproject.toml   # Project dependencies and metadata
├── AGENT.md         # Development rules and guidelines
├── example.py       # Usage examples and demonstrations
└── .venv/           # Virtual environment (created by uv)

Key Dependencies

mcp: Model Context Protocol framework
requests: HTTP request handling
beautifulsoup4: HTML parsing and content extraction
markdownify: HTML to markdown conversion

Customization

The server can be easily customized by modifying main.py:

Content Selectors: Modify the CSS selectors used to find main content
Rate Limiting: Adjust the minimum interval between requests
Timeout Settings: Change default and maximum timeout values
Content Filtering: Add custom content processing or filtering rules
Markdown Options: Customize markdown conversion settings

Testing the Server

Test the server directly:

uvx git+https://github.com/bhubbb/mcp-fetch-as-markdown

Or with local installation:

cd mcp-fetch-as-markdown
uv run python main.py

For interactive testing, use the example script:

uv run python example.py

Troubleshooting

Common Issues

Import Errors: Make sure all dependencies are installed with uv sync
Connection Timeouts: Some websites may be slow; try increasing the timeout parameter
Rate Limiting: The server enforces 1-second intervals between requests
Blocked Requests: Some websites may block automated requests; this is expected behavior

Debugging

Enable debug logging by modifying the logging level in main.py:

logging.basicConfig(level=logging.DEBUG)

Website Compatibility

Modern Websites: Works best with standard HTML structure
JavaScript-heavy Sites: Cannot execute JavaScript; fetches initial HTML only
Protected Content: Respects robots.txt and website access restrictions
Rate Limits: Implements respectful scraping practices

Ethical Usage

This tool is designed for legitimate research, analysis, and content processing. Please:

Respect Terms of Service: Always check and comply with website terms of service
Avoid Overloading: The built-in rate limiting helps, but be mindful of request frequency
Attribution: Give proper credit to original sources when using extracted content
Legal Compliance: Ensure your use case complies with applicable laws and regulations

Contributing

This is a simple, single-file implementation designed for clarity and ease of modification. Feel free to:

Add support for additional content extraction strategies
Implement custom filtering for specific website types
Add caching for better performance
Extend with additional markdown formatting options

License

This project uses the same license as its dependencies. Content fetched from websites remains subject to the original website's terms of service and copyright.