conKurrence

AI evaluation toolkit — measure inter-rater agreement (Fleiss' κ, Kendall's W) across multiple LLM providers

ConKurrence

One command. Find out if your AI agrees with itself.

ConKurrence is a statistically validated consensus measurement toolkit for AI evaluation pipelines. It uses multiple AI models as independent raters, measures inter-rater reliability with Fleiss' kappa and bootstrap confidence intervals, and routes contested items to human experts.

Install

npm install -g conkurrence

MCP Server

Use ConKurrence as an MCP server in Claude Desktop or any MCP-compatible client:

npx conkurrence mcp

Claude Desktop Configuration

Add to your claude_desktop_config.json:

{
  "mcpServers": {
    "conkurrence": {
      "command": "npx",
      "args": ["-y", "conkurrence", "mcp"]
    }
  }
}

Claude Code Plugin

/plugin marketplace add AlligatorC0der/conkurrence

Features

  • Multi-model evaluation — Run your schema against Bedrock, OpenAI, and Gemini models simultaneously
  • Statistical rigor — Fleiss' kappa with bootstrap confidence intervals, Kendall's W for validity
  • Self-consistency mode — No API keys needed; uses the host model via MCP Sampling
  • Schema suggestion — AI-powered schema design from your data
  • Trend tracking — Compare runs over time, detect agreement degradation
  • Cost estimation — Know the cost before running

MCP Tools

ToolDescription
conkurrence_runExecute an evaluation across multiple AI raters
conkurrence_reportGenerate a detailed markdown report
conkurrence_compareSide-by-side comparison of two runs
conkurrence_trendTrack agreement over multiple runs
conkurrence_suggestAI-powered schema suggestion from your data
conkurrence_validate_schemaValidate a schema before running
conkurrence_estimateEstimate cost and token usage

Links

License

BUSL-1.1 — Business Source License 1.1

Máy chủ liên quan

NotebookLM Web Importer

Nhập trang web và video YouTube vào NotebookLM chỉ với một cú nhấp. Được tin dùng bởi hơn 200.000 người dùng.

Cài đặt tiện ích Chrome