x402-vision-cropper MCP Server

MCP server that lets AI agents crop screenshots to exact pixel regions before vision LLM inference, paying 0.0005 USDC per crop on Base L2 via x402 — no API key required.

Documentation

x402 AI Vision Sub-Element Cropper

llms.txt — machine-readable service contract for autonomous AI agents

Spec: https://llmstxt.org

Stateless image cropping API for AI agents. Submit a base64 screenshot and pixel bounding-box coordinates, receive a cropped sub-element as base64 PNG. Gated by a 0.0005 USDC micropayment on Base L2. Designed to reduce vision LLM token costs by isolating only the screen region of interest before inference.

Service Identity

  • Name: x402 AI Vision Sub-Element Cropper
  • Version: 1.0.0
  • Protocol: HTTP/1.1, JSON
  • Authentication: x402 micropayment (no API keys, no accounts)
  • Statefulness: Stateless — no session, no stored data, no database

Payment Protocol

This API uses the HTTP 402 Payment Required pattern for autonomous machine-to-machine payments.

Step 1 — Trigger the challenge

Make any request to POST /crop without a payment header. You will receive a 402 response containing payment instructions in both the response body and HTTP headers.

Step 2 — Pay on-chain

Transfer exactly the required USDC amount to the recipient wallet on the specified network. The required values are machine-readable in the 402 response headers:

  • x-payment-price-usdc: amount in USDC (e.g. 0.0005)
  • x-payment-recipient: destination wallet address
  • x-payment-network: network name (e.g. base)
  • x-payment-chain-id: EVM chain ID (e.g. 8453 for Base mainnet)
  • x-payment-token: always USDC
  • x-payment-token-contract: USDC contract address on that network
  • x-payment-submit-header: the header name to use when resubmitting

Step 3 — Resubmit with proof

Retry the identical POST /crop request, adding the transaction hash header:

x-payment-tx-hash: 0x

The server verifies the transaction on-chain (checks receipt status, USDC Transfer log, recipient address, and amount), then executes the crop and returns the result.

Payment rules

  • One payment per request. Each tx hash is single-use within a 60-second window.
  • Do not reuse tx hashes across concurrent requests — the server will reject duplicates.
  • If the server returns a 5xx error after accepting your payment, the tx hash is released and you may retry with the same hash.
  • If the server returns a 4xx error after accepting your payment (e.g. invalid coordinates), the tx hash is consumed. Pay again for a new request.

Endpoints

POST /crop

Crops a rectangular region from a base64-encoded image.

Request headers:

  • Content-Type: application/json (required)
  • x-payment-tx-hash: 0x (required after payment)

Request body (JSON):

  • image (string, required): Base64-encoded source image. Do NOT include a data URI prefix (no "data:image/png;base64," — strip it first). Supports PNG, JPEG, WebP, AVIF, TIFF, GIF.
  • x (integer, required): Left edge of crop region in pixels. Must be >= 0.
  • y (integer, required): Top edge of crop region in pixels. Must be >= 0.
  • width (integer, required): Width of crop region in pixels. Must be >= 1.
  • height (integer, required): Height of crop region in pixels. Must be >= 1.

Coordinates map directly to getBoundingClientRect() output from browser automation tools. Integer truncation is applied if floats are passed.

Success response (200):

{
  "success": true,
  "data": {
    "base64": "<PNG image as base64 string>",
    "mime": "image/png",
    "width": 640,
    "height": 80,
    "bytes": 14821,
    "clamped": false
  },
  "meta": {
    "tx_hash": "0x...",
    "crop_input": { "x": 120, "y": 45, "width": 640, "height": 80 }
  }
}

Output is always PNG (lossless). Suitable for direct embedding in vision LLM prompts as a base64 data URI: prepend "data:image/png;base64," to the returned base64 string.

The "clamped" field is true if the requested crop region extended beyond the image boundaries and was automatically reduced to fit. Recalibrate your coordinate source if you receive clamped: true.

402 response (no payment or invalid payment):

{
  "error": "Payment Required",
  "code": 402,
  "payment": {
    "amount_usdc": "0.0005",
    "recipient_wallet": "0x...",
    "network": "base",
    "chain_id": 8453,
    "token": "USDC",
    "token_contract": "0x833589fCD6eDb6E08f4c7C32D4f71b54bdA02913",
    "instructions": "Transfer 0.0005 USDC to 0x... on base (chain 8453), then retry with the transaction hash in the 'x-payment-tx-hash' header."
  }
}

Error responses:

  • 400: Malformed request body or invalid image data
  • 402: Payment required, invalid, or replayed
  • 413: Request body exceeds size limit (~14 MB)
  • 422: Crop region is entirely outside image bounds
  • 500: Server-side processing error (tx hash is released; safe to retry)

GET /health

Liveness check. No payment required.

Response (200):

{
  "status": "ok",
  "service": "x402-vision-cropper",
  "ts": "2025-01-01T00:00:00.000Z",
  "replay_guard": { "tracked_hashes": 0 }
}

Agent Integration Pattern

1. GET /health                          → confirm service is live
2. POST /crop (no payment header)       → receive 402 + payment.instructions
3. Parse x-payment-recipient, x-payment-price-usdc, x-payment-chain-id
4. Execute USDC transfer on Base L2
5. Wait for transaction confirmation (1+ block)
6. POST /crop (same body + x-payment-tx-hash header) → receive cropped PNG
7. Prepend "data:image/png;base64," to data.base64
8. Pass to vision LLM as image input

Recommended Use Cases

  • OCR on specific UI elements (buttons, labels, input fields, table cells)
  • Visual regression testing on isolated components
  • Extracting price, stock, or data fields from financial dashboards
  • Reading CAPTCHA sub-regions before passing to specialist solvers
  • Any task where a full-screenshot vision call is wasteful or inaccurate

Constraints and Limits

  • Maximum image input: ~10 MB decoded (14 MB base64)
  • Maximum crop dimension: 4000px per side
  • Concurrent requests: optimised for 10–13 simultaneous requests
  • No persistent storage: all data is discarded after each response
  • Replay window: tx hashes are locked for 60 seconds, max 1000 tracked simultaneously
  • Output format: always PNG, regardless of input format

Network Details

  • Payment network: Base (Base mainnet, chain ID 8453)
  • USDC contract on Base: 0x833589fCD6eDb6E08f4c7C32D4f71b54bdA02913
  • Gas: near-zero (Base L2)
  • Settlement: typically 1–2 seconds

Optional: Full llms-full.txt

See /llms-full.txt for extended examples including multi-step agent pseudocode, error recovery flows, and coordinate extraction patterns from common browser automation frameworks (Playwright, Puppeteer, Selenium).