llmprobe
Probe LLM API endpoints and report health metrics including time to first token, latency, and throughput. Check single models or run full config-based health checks.
![]()
llmprobe
Probe LLM API endpoints. Measure TTFT, latency, throughput. Single binary, zero SDKs.
llmprobe is a CLI tool that probes LLM API endpoints and measures the metrics that matter for production reliability: time to first token (TTFT), total latency, generation throughput (tokens/sec), and error rates.
Use it as a one-off health check, a continuous monitor, or a CI gate that blocks deploys when your LLM provider is degraded.

Quick start
Download a prebuilt binary from the latest release (Linux, macOS, Windows; amd64 and arm64).
Or install from source:
go install github.com/Jwrede/llmprobe@latest
Create a probes.yml (or copy the included example):
providers:
- name: openai
api_key: ${OPENAI_API_KEY}
models:
- name: gpt-4o
thresholds:
max_ttft: 2s
- name: gpt-4o-mini
thresholds:
max_ttft: 500ms
- name: anthropic
api_key: ${ANTHROPIC_API_KEY}
models:
- name: claude-sonnet-4-20250514
thresholds:
max_ttft: 1s
Run a probe:
$ llmprobe probe
Provider Model Status TTFT Latency Tok/s Tokens Error
-------- ----- ------ ---- ------- ----- ------ -----
openai gpt-4o healthy 312ms 2100ms 68.4 42
openai gpt-4o-mini healthy 98ms 814ms 112.3 56
anthropic claude-sonnet-4-20250514 healthy 420ms 2831ms 52.1 38
azure gpt-4o healthy 289ms 1950ms 71.2 44
bedrock anthropic.claude-3-5... degraded 1820ms 4510ms 28.1 38
4 healthy, 1 degraded, 0 errors
What it measures
| Metric | What it means |
|---|---|
| TTFT | Time from request send to first content token. This is what users feel as "lag" before the response starts streaming. |
| Latency | Total time from request to stream close. |
| Tok/s | Generation throughput: tokens produced per second after the first token. Calculated as token_count / (latency - ttft). |
| Tokens | Total output tokens. Prefers provider usage metadata when available, falls back to SSE event counting. |
| Status | healthy if all thresholds pass, degraded if any threshold is exceeded, error if the request failed. |
Commands
llmprobe probe
One-off health check. Probes all configured endpoints and prints results.
llmprobe probe # table output
llmprobe probe -f json # JSON output
llmprobe probe --fail-on degraded # exit 1 if any endpoint is degraded
llmprobe probe -c custom-config.yml # custom config path
Exit codes for CI:
--fail-on | Exit 0 | Exit 1 |
|---|---|---|
error (default) | healthy or degraded | any error |
degraded | healthy only | degraded or error |
none | always | never |
llmprobe watch
Continuous monitoring. Probes all endpoints on an interval and prints a summary line per iteration.
llmprobe watch # default 60s interval
llmprobe watch --interval 30s # custom interval
llmprobe watch --tui # live terminal dashboard with TTFT chart
llmprobe watch --tui --load data.jsonl # load historical data into the dashboard
llmprobe watch -f json # JSONL output (one line per result)
The --tui flag launches a live terminal dashboard with a TTFT chart,
color legend, and statistics table. Use --load to import historical
JSONL data (from llmprobe watch -f json > data.jsonl).

$ llmprobe watch --interval 30s
Watching 4 endpoints every 30s (Ctrl+C to stop)
[14:01:02] All 4 endpoints healthy.
[14:01:32] All 4 endpoints healthy.
[14:02:02] 3 healthy, 1 degraded, 0 errors. DEGRADED: openai/gpt-4o (TTFT 1820ms)
[14:02:32] All 4 endpoints healthy.
CI integration
Use llmprobe probe as a pre-deploy gate:
# .github/workflows/deploy.yml
- name: Check LLM providers
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
run: |
go install github.com/Jwrede/llmprobe@latest
llmprobe probe --fail-on degraded
This blocks the deploy if any LLM provider is experiencing degraded performance right now.
MCP server
llmprobe includes a built-in Model Context Protocol server, allowing Claude Code and other MCP hosts to check LLM API health directly from an agent workflow.
Running the server
llmprobe mcp
This starts the MCP server over stdio.
Registering with Claude Code
claude mcp add --transport stdio llmprobe -- llmprobe mcp
Once registered, Claude Code can call llmprobe tools during any conversation.
Available tools
| Tool | Description |
|---|---|
probe_all | Probe all configured endpoints from probes.yml. Returns TTFT, latency, throughput, and health status for every model. Accepts an optional config parameter for a custom config path. |
probe_model | Probe a single model without a config file. Requires provider (openai, anthropic, google, azure, bedrock), model (the model identifier), and api_key_env (env var holding the API key). |
list_providers | List all providers and models in the config file with their thresholds. Use this to discover available models before probing. |
get_config | Return the full parsed configuration including defaults, providers, models, and thresholds. |
Example use case: An agent calls list_providers to see what models
are configured, then probe_all to verify they are healthy before
deploying changes.
Configuration
defaults:
prompt: "Hello" # probe prompt
max_tokens: 20 # max output tokens
timeout: 30s # per-probe timeout
concurrency: 5 # max parallel probes
providers:
- name: openai # openai, anthropic, google, azure, bedrock
api_key: ${OPENAI_API_KEY} # env var expansion
base_url: https://custom.api # optional, override endpoint
models:
- name: gpt-4o
prompt: "Say hello." # override default prompt
max_tokens: 10 # override default max_tokens
thresholds:
max_ttft: 2s # alert if TTFT exceeds this
max_latency: 10s # alert if total latency exceeds this
min_tokens_per_sec: 20 # alert if throughput drops below this
- name: azure
api_key: ${AZURE_OPENAI_API_KEY}
base_url: https://your-resource.openai.azure.com
api_version: "2024-10-21" # optional, defaults to 2024-10-21
models:
- name: gpt-4o # deployment name
- name: bedrock
access_key: ${AWS_ACCESS_KEY_ID}
secret_key: ${AWS_SECRET_ACCESS_KEY}
region: us-east-1
models:
- name: anthropic.claude-3-5-sonnet-20241022-v2:0
API keys and AWS credentials support ${ENV_VAR} syntax. Only credential
fields are expanded, so env var references in prompts or model names are
left as-is.
OpenAI-compatible providers
Many providers (Groq, Together AI, Fireworks, DeepSeek, Mistral, OpenRouter,
Ollama, vLLM) expose an OpenAI-compatible API. These work out of the box
by setting base_url:
providers:
# Groq
- name: openai
api_key: ${GROQ_API_KEY}
base_url: https://api.groq.com/openai
models:
- name: llama-3.3-70b-versatile
# DeepSeek
- name: openai
api_key: ${DEEPSEEK_API_KEY}
base_url: https://api.deepseek.com
models:
- name: deepseek-chat
# Together AI
- name: openai
api_key: ${TOGETHER_API_KEY}
base_url: https://api.together.xyz
models:
- name: meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo
# Local Ollama
- name: openai
api_key: unused
base_url: http://localhost:11434/v1
models:
- name: llama3.2
Architecture
probes.yml
-> Config loader (YAML + env var expansion)
-> Probe engine (concurrent goroutines per provider/model)
-> Provider clients (raw HTTP + SSE parsing, no SDKs)
-> Results (TTFT, latency, tokens/sec, status)
-> Output (table, JSON, JSONL)
Each provider client is a thin HTTP wrapper that sends a streaming request and parses the response. No LLM SDKs are imported. The SSE parser handles both data-only events (OpenAI, Google) and named events (Anthropic). The Bedrock client implements SigV4 signing and AWS binary event stream parsing from scratch.
TTFT is measured from the moment the HTTP request is sent to the first event that contains actual content text (not role assignments or metadata).
Providers
| Provider | Endpoint | Auth | Streaming format |
|---|---|---|---|
| OpenAI | /v1/chat/completions | Authorization: Bearer | SSE, [DONE] sentinel |
| Anthropic | /v1/messages | x-api-key header | named-event SSE |
/v1beta/models/{model}:streamGenerateContent?alt=sse | key query param | SSE | |
| Azure OpenAI | /openai/deployments/{model}/chat/completions | api-key header | SSE, [DONE] sentinel |
| AWS Bedrock | /model/{model}/converse-stream | SigV4 | AWS binary event stream |
| OpenAI-compat | /v1/chat/completions (custom base_url) | Authorization: Bearer | SSE |
OpenAI-compatible covers: Groq, Together AI, Fireworks, DeepSeek, Mistral, OpenRouter, Ollama, vLLM, and any endpoint that speaks the OpenAI chat completions API.
Live benchmark
llm-bench uses llmprobe to run a continuous public benchmark of major LLM APIs. Results are published as an open JSONL dataset and a live terminal dashboard at bench.jonathanwrede.de.
Roadmap
- Baseline tracking: store rolling percentiles, alert when current probe exceeds Nx baseline
- OpenTelemetry metric export for integration with Grafana/Datadog
- Prometheus
/metricsendpoint - Structured output validation: verify JSON mode responses parse correctly
License
MIT
Related Servers
Alpha Vantage MCP Server
sponsorAccess financial market data: realtime & historical stock, ETF, options, forex, crypto, commodities, fundamentals, technical indicators, & more
Bitcoin & Lightning Network
Interact with the Bitcoin and Lightning Network to generate keys, validate addresses, decode transactions, and query the blockchain.
Agent Passport System
Cryptographic identity, scoped delegation, values governance, and deliberative consensus for AI agents. 11 tools, Ed25519 signatures, zero blockchain.
Animated video MCP Server
Executes Manim Python animation scripts to generate and return videos.
YetiBrowser MCP
YetiBrowser MCP is a fully open-source solution to allow AI assistants to easily interact with your existing browser
Matware E2E Runner
JSON-driven E2E test runner with parallel Chrome pool execution, visual verification, and 16 MCP tools.
Figma
Integrate Figma design data with AI coding tools using a local MCP server.
Omilia MCP Tools
A set of tools for managing miniapps, orchestrator apps, and dialog logs on the Omilia Cloud Platform (OCP).
Vibe Check
The definitive Vibe Coder's sanity check MCP server: Prevents cascading errors by calling a "Vibe-check" agent to ensure alignment and prevent scope creep
Fetter MCP
Get the most-recent Python package without vulnerabilities, and more.
MCP WordPress Server
A comprehensive MCP server for managing WordPress sites, featuring a wide range of tools for performance monitoring, caching, and more.