BerryRAG
A local RAG system with Playwright MCP integration for Claude and OpenAI embeddings, using local storage.
🍓 BerryRAG: Local Vector Database with Playwright MCP Integration
A complete local RAG (Retrieval-Augmented Generation) system that integrates Playwright MCP web scraping with vector database storage for Claude.
✨ Features
- Zero-cost self-hosted vector database
- Playwright MCP integration for automated web scraping
- Multiple embedding providers (sentence-transformers, OpenAI, fallback)
- Smart content processing with quality filters
- Claude-optimized context formatting
- MCP server for direct Claude integration
- Command-line tools for manual operation
🚀 Quick Start
1. Installation
git clone https://github.com/berrydev-ai/berry-rag.git
cd berry-rag
# Install dependencies
npm run install-deps
# Setup directories and instructions
npm run setup
2. Configure Claude Desktop
Add to your claude_desktop_config.json:
{
"mcpServers": {
"playwright": {
"command": "npx",
"args": ["@playwright/mcp@latest"]
},
"berry-rag": {
"command": "node",
"args": ["mcp_servers/vector_db_server.js"],
"cwd": "/Users/eberry/BerryDev/berry-rag"
}
}
}
3. Start Using
# Example workflow:
# 1. Scrape with Playwright MCP through Claude
# 2. Process into vector DB
npm run process-scraped
# 3. Search your knowledge base
npm run search "React hooks"
📁 Project Structure
berry-rag/
├── src/ # Python source code
│ ├── rag_system.py # Core vector database system
│ └── playwright_integration.py # Playwright MCP integration
├── mcp_servers/ # MCP server implementations
│ └── vector_db_server.ts # TypeScript MCP server
├── storage/ # Vector database storage
│ ├── documents.db # SQLite metadata
│ └── vectors/ # NumPy embedding files
├── scraped_content/ # Playwright saves content here
└── dist/ # Compiled TypeScript
🔧 Commands
Streamlit Web Interface
Launch the web interface for easy interaction with your RAG system:
# Start the Streamlit web interface
python run_streamlit.py
# Or directly with streamlit
streamlit run streamlit_app.py
The web interface provides:
- 🔍 Search: Interactive document search with similarity controls
- 📄 Context: Generate formatted context for AI assistants
- ➕ Add Document: Upload files or paste content directly
- 📚 List Documents: Browse your document library
- 📊 Statistics: System health and performance metrics
NPM Scripts
| Command | Description |
|---|---|
npm run install-deps | Install all dependencies |
npm run setup | Initialize directories and instructions |
npm run build | Compile TypeScript MCP server |
npm run process-scraped | Process scraped files into vector DB |
npm run search | Search the knowledge base |
npm run list-docs | List all documents |
Python CLI
# RAG System Operations
python src/rag_system.py search "query"
python src/rag_system.py context "query" # Claude-formatted
python src/rag_system.py add <url> <title> <file>
python src/rag_system.py list
python src/rag_system.py stats
# Playwright Integration
python src/playwright_integration.py process
python src/playwright_integration.py setup
python src/playwright_integration.py stats
🤖 Usage with Claude
1. Scraping Documentation
"Use Playwright to scrape the React hooks documentation from https://react.dev/reference/react and save it to the scraped_content directory"
2. Processing into Vector Database
"Process all new scraped files and add them to the BerryRAG vector database"
3. Querying Knowledge Base
"Search the BerryRAG database for information about React useState best practices"
"Get context from the vector database about implementing custom hooks"
🔌 MCP Tools Available to Claude
BerryRAG provides two powerful MCP servers for Claude integration:
Vector DB Server Tools
add_document- Add content directly to vector DBsearch_documents- Search for similar contentget_context- Get formatted context for querieslist_documents- List all stored documentsget_stats- Vector database statisticsprocess_scraped_files- Process Playwright scraped contentsave_scraped_content- Save content for later processing
BerryExa Server Tools
crawl_content- Advanced web content extraction with subpage supportextract_links- Extract internal links for subpage discoveryget_content_preview- Quick content preview without full processing
📖 For complete MCP setup and usage guide, see BERRY_MCP.md
🧠 Embedding Providers
The system supports multiple embedding providers with automatic fallback:
- sentence-transformers (recommended, free, local)
- OpenAI embeddings (requires API key, set
OPENAI_API_KEY) - Simple hash-based (fallback, not recommended for production)
⚙️ Configuration
Environment Variables
# Optional: for OpenAI embeddings
export OPENAI_API_KEY=your_key_here
Content Quality Filters
The system automatically filters out:
- Content shorter than 100 characters
- Navigation-only content
- Repetitive/duplicate content
- Files larger than 500KB
Chunking Strategy
- Default chunk size: 500 characters
- Overlap: 50 characters
- Smart boundary detection (sentences, paragraphs)
📊 Monitoring
Check System Status
# Vector database statistics
python src/rag_system.py stats
# Processing status
python src/playwright_integration.py stats
# View recent documents
python src/rag_system.py list
Storage Information
- Database:
storage/documents.db(SQLite metadata) - Vectors:
storage/vectors/(NumPy arrays) - Scraped Content:
scraped_content/(Markdown files)
🔍 Example Workflows
Academic Research
- Scrape research papers with Playwright
- Process into vector database
- Query for specific concepts across all papers
Documentation Management
- Scrape API documentation from multiple sources
- Build unified searchable knowledge base
- Get contextual answers about implementation details
Content Aggregation
- Scrape blog posts and articles
- Create topic-based knowledge clusters
- Find related content across sources
🛠️ Development
Building the MCP Server
npm run build
Running in Development Mode
npm run dev # TypeScript watch mode
Testing
# Test RAG system
python src/rag_system.py stats
# Test integration
python src/playwright_integration.py setup
# Test MCP server
node mcp_servers/vector_db_server.js
🚨 Troubleshooting
Common Issues
Python dependencies missing:
pip install -r requirements.txt
TypeScript compilation errors:
npm install
npm run build
Embedding model download slow: The first run downloads sentence-transformers model (~90MB). This is normal.
No results from search:
- Check if documents were processed:
python src/rag_system.py list - Verify content quality filters aren't too strict
- Try broader search terms
Logs and Debugging
- Python logs: Check console output
- MCP server logs: Stderr output
- Processing status:
scraped_content/.processed_files.json
📝 License
MIT License - feel free to modify and extend for your needs.
🤝 Contributing
This is a personal project for Eric Berry, but feel free to fork and adapt for your own use cases.
Happy scraping and searching! 🕷️🔍✨
Serveurs connexes
Alpha Vantage MCP Server
sponsorAccess financial market data: realtime & historical stock, ETF, options, forex, crypto, commodities, fundamentals, technical indicators, & more
Proteus Workflow Engine
A modern, extensible multi-agent workflow engine with real-time monitoring and a web visualization interface.
analyze-coverage-mcp
MCP server that bridges LCOV coverage reports to AI agents.
godot-mcp-runtime
Playwright MCP for Godot, screenshots, SceneTree manipulation, and arbitrary GDScript execution at runtime through a local UDP bridge.
Bitrix24 MCP-DEV
The MCP server for Bitrix24 provides AI assistants with structured access to the Bitrix24 API. It delivers up-to-date method descriptions, parameters, and valid values, allowing assistants to work with precise data instead of guesswork. This reduces code errors and accelerates Bitrix24 integration development.
ID Generator MCP
Generate unique IDs using various algorithms like UUID, CUID2, Nanoid, and ULID.
AI Intervention Agent
An MCP server for real-time user intervention in AI-assisted development workflows.
Criage MCP Server
An MCP server for the Criage package manager, providing full client functionality via the MCP protocol.
Remote MCP Server (Authless)
An example of a remote MCP server deployable on Cloudflare Workers, featuring customizable tools and no authentication.
Modes MCP Server
An MCP server for programmatically managing Roo's custom operational modes.
MCP Server Health Monitor
Health monitoring for all your MCP servers — probes, SLA tracking, dependency graphs, auto-restart