PromptThrift MCP

Smart token compression for LLM apps. Save 70-90% on API costs with Gemma 4 local compression, multi-model cost tracking, and intelligent model routing.

PromptThrift MCP: Smart Token Compression for LLM Apps

Cut 70-90% of your LLM API costs with intelligent conversation compression. Now with Gemma 4 local compression: smarter summaries, zero API cost.

PromptThrift MCP server

License: MIT Python 3.10+ MCP Compatible Gemma 4

If this saves you money, star this repo!

The Problem

Every LLM API call resends your entire conversation history. A 20-turn chat costs 6x more per call than a 3-turn one, so you're paying for the same old messages over and over.

Turn 1:  ████ 700 tokens ($0.002)
Turn 5:  ████████████████ 4,300 tokens ($0.013)
Turn 20: ████████████████████████████████████████ 12,500 tokens ($0.038)
                                              ↑ You're paying for THIS every call

The Solution

PromptThrift is an MCP server with 4 tools to slash your API costs:

ToolWhat it doesImpact
promptthrift_compress_historyCompress old turns into a smart summary50-90% fewer input tokens
promptthrift_count_tokensTrack token usage & costs across 14 modelsKnow where money goes
promptthrift_suggest_modelRecommend cheapest model for the task60-80% on simple tasks
promptthrift_pin_factsPin critical facts that survive compressionNever lose key context

Why PromptThrift?

PromptThriftContext ModeHeadroom
LicenseMIT (commercial OK)ELv2 (no competing)Apache 2.0
Compression typeConversation memoryTool schema virtualizationTool output
Local LLM supportGemma 4 via OllamaNoNo
Cost trackingMulti-model comparisonNoNo
Model routingBuilt-inNoNo
Pinned factsNever-Compress ListNoNo

Quick Start

Install

Option A: pip install (recommended)

pip install git+https://github.com/woling-dev/promptthrift-mcp.git

Option B: clone and install

git clone https://github.com/woling-dev/promptthrift-mcp.git
cd promptthrift-mcp
pip install -e .

Optional: Enable Gemma 4 Compression

For smarter AI-powered compression (free, runs locally):

# Install Ollama: https://ollama.com
ollama pull gemma4:e4b

PromptThrift auto-detects Ollama. If running → uses Gemma 4 for compression. If not → falls back to fast heuristic compression. Zero config needed.

Claude Desktop

Add to claude_desktop_config.json:

{
  "mcpServers": {
    "promptthrift": {
      "command": "python",
      "args": ["/path/to/promptthrift-mcp/server.py"]
    }
  }
}

Cursor / Windsurf

Add to your MCP settings:

{
  "mcpServers": {
    "promptthrift": {
      "command": "python",
      "args": ["/path/to/promptthrift-mcp/server.py"]
    }
  }
}

Real-World Example

An AI coding assistant debugging a complex issue over 30+ turns:

Before compression (sent every API call):

User: My Next.js app throws a hydration error on the /dashboard page.
Asst: That usually means server and client HTML don't match. Can you share the component?
User: [pastes 50 lines of DashboardLayout.tsx]
Asst: I see the issue, you're using `new Date()` directly in render, which differs
      between server and client. Let me also check your data fetching...
User: I also get a warning about useEffect running twice.
Asst: That's React 18 Strict Mode. Not related to hydration. Let me trace the real bug...
User: Wait, there's also a flash of unstyled content on first load.
Asst: That's a separate CSS loading order issue. Let me address both...
      [... 25 more turns of debugging, trying fixes, checking logs ...]
User: OK it's fixed now! But I want to add dark mode next.
Asst: Great! For dark mode with Next.js + Tailwind, here are three approaches...

~8,500 tokens after 30 turns, and growing every single API call

After Gemma 4 compression:

[Compressed history]
Resolved Next.js hydration error in DashboardLayout.tsx caused by
Date() in render (fixed with useEffect). Unrelated: React 18 Strict Mode
double-fire (expected), CSS flash (fixed via loading order).
User now wants to add dark mode to Next.js + Tailwind app.
[End compressed history]

[Recent turns preserved, last 4 turns intact]

~1,200 tokens. 86% saved on every subsequent call

Cost impact at scale (Claude Sonnet @ $3/MTok):

ScenarioWithout PromptThriftWith PromptThriftMonthly Savings
1 dev, 20 sessions/day$5.10/mo$0.72/mo$4.38
Team of 10 devs$51/mo$7.20/mo$43.80
Customer service bot (500 chats/day)$255/mo$36/mo$219
AI agent platform (5K sessions/day)$2,550/mo$357/mo$2,193

Pinned Facts (Never-Compress List)

Some facts must never be lost during compression: user names, critical preferences, key decisions. Pin them:

You: "Pin the fact that this customer is allergic to nuts"

→ promptthrift_pin_facts(action="add", facts=["Customer is allergic to nuts"])
→ This fact will appear in ALL future compressed summaries, guaranteed.

Supported Models (April 2026 pricing)

ModelInput $/MTokOutput $/MTokLocal?
gemma-4-e2b$0.00$0.00Ollama
gemma-4-e4b$0.00$0.00Ollama
gemma-4-27b$0.00$0.00Ollama
gemini-2.0-flash$0.10$0.40
gpt-4.1-nano$0.10$0.40
gpt-4o-mini$0.15$0.60
gemini-2.5-flash$0.15$0.60
gpt-4.1-mini$0.40$1.60
claude-haiku-4.5$1.00$5.00
gemini-2.5-pro$1.25$10.00
gpt-4.1$2.00$8.00
gpt-4o$2.50$10.00
claude-sonnet-4.6$3.00$15.00
claude-opus-4.6$5.00$25.00

How It Works

Before (every API call sends ALL of this):
┌──────────────────────────────────┐
│ System prompt      (500 tokens)  │
│ Turn 1: user+asst  (600 tokens)  │  ← Repeated every call
│ Turn 2: user+asst  (600 tokens)  │  ← Repeated every call
│ ...                              │
│ Turn 8: user+asst  (600 tokens)  │  ← Repeated every call
│ Turn 9: user+asst  (new)         │
│ Turn 10: user      (new)         │
└──────────────────────────────────┘
Total: ~6,500 tokens per call

After PromptThrift compression:
┌──────────────────────────────────┐
│ System prompt      (500 tokens)  │
│ [Pinned facts]      (50 tokens)  │  ← Always preserved
│ [Compressed summary](200 tokens) │  ← Turns 1-8 in 200 tokens!
│ Turn 9: user+asst  (kept)        │
│ Turn 10: user      (kept)        │
└──────────────────────────────────┘
Total: ~1,750 tokens per call (73% saved!)

Compression Modes

ModeMethodQualitySpeedCost
HeuristicRule-based extractionGood (50-60% reduction)InstantFree
LLM (Gemma 4)AI-powered understandingExcellent (70-90% reduction)~10-15sFree (local)

PromptThrift automatically uses the best available method. Install Ollama + Gemma 4 for maximum compression quality.

When Does Compression Shine?

Compression effectiveness scales with conversation length and redundancy:

Conversation LengthTypical ReductionBest For
Short (< 5 turns, mostly technical)15-25%Minimal savings: keep as-is
Medium (10-20 turns, mixed chat)50-70%Sweet spot: clear cost reduction
Long (30+ turns, debugging/iterating)70-90%Massive savings: compress early and often

Why? Short, dense conversations have little filler to remove. Longer conversations accumulate greetings, repeated context, exploratory dead-ends, and verbose explanations, and those are exactly what the compressor strips away. A 30-turn debugging session with code snippets, back-and-forth troubleshooting, and final resolution compresses dramatically because only the conclusion and key decisions matter for future context.

Rule of thumb: Start compressing after 8-10 turns for best results.

Environment Variables

VariableRequiredDefaultDescription
PROMPTTHRIFT_OLLAMA_MODELNogemma4:e4bOllama model for LLM compression
PROMPTTHRIFT_OLLAMA_URLNohttp://localhost:11434Ollama API endpoint
PROMPTTHRIFT_DEFAULT_MODELNoclaude-sonnet-4.6Default model for cost estimates

Security

  • All data processed locally by default. Nothing leaves your machine
  • Ollama compression runs 100% on your hardware
  • Post-compression sanitizer strips prompt injection patterns from summaries
  • API keys read from environment variables only, never hardcoded
  • No persistent storage, no telemetry, no third-party calls

Roadmap

  • Heuristic conversation compression
  • Multi-model token counting (14 models)
  • Intelligent model routing
  • Gemma 4 local LLM compression via Ollama
  • Pinned facts (Never-Compress List)
  • Post-compression security sanitizer
  • Cloud-based compression (Anthropic/OpenAI API fallback)
  • Prompt caching optimization advisor
  • Web dashboard for usage analytics
  • VS Code extension

Contributing

PRs welcome! This project uses MIT license. Fork it, improve it, ship it.

About BrandDefender.ai

BrandDefender.ai is the product line of Wolin Global Media (沃嶺國際媒體), a Taiwan-based AI infrastructure studio helping brands get discovered, understood, and recommended by AI systems.

What we build

🔍 AEO Consulting (Answer Engine Optimization) Get your brand correctly cited by ChatGPT, Gemini, Perplexity, and Claude. We implement JSON-LD schema, optimize content structure, and monitor AI search presence for Taiwan food, tea, beauty, and lifestyle brands.

💬 AI Customer Service (LINE Bot) Production-grade LINE chatbots with 3-layer memory, admin takeover, and Supabase backend. Already serving real brands in retail and F&B.

🧠 AI Memory MCP Infrastructure Open-source MCP servers for Claude Code, Cursor, and LLM builders. Local-first, privacy-preserving, built to save API cost.

Contact

台灣品牌想做 AEO audit:我們提供 ChatGPT / Gemini / Perplexity 全面掃描 + JSON-LD 修補 + 月度監測。EmailLINE 直接找我們聊。

License

MIT License. Free for personal and commercial use.


© 2026 Wolin Global Media (沃嶺國際媒體).

Star this repo if it saves you money!

相關伺服器

NotebookLM 網頁匯入器

一鍵將網頁和 YouTube 影片匯入 NotebookLM。超過 200,000 位使用者信賴。

安裝 Chrome 擴充功能