conKurrence

AI evaluation toolkit — measure inter-rater agreement (Fleiss' κ, Kendall's W) across multiple LLM providers

ConKurrence

One command. Find out if your AI agrees with itself.

ConKurrence is a statistically validated consensus measurement toolkit for AI evaluation pipelines. It uses multiple AI models as independent raters, measures inter-rater reliability with Fleiss' kappa and bootstrap confidence intervals, and routes contested items to human experts.

Install

npm install -g conkurrence

MCP Server

Use ConKurrence as an MCP server in Claude Desktop or any MCP-compatible client:

npx conkurrence mcp

Claude Desktop Configuration

Add to your claude_desktop_config.json:

{
  "mcpServers": {
    "conkurrence": {
      "command": "npx",
      "args": ["-y", "conkurrence", "mcp"]
    }
  }
}

Claude Code Plugin

/plugin marketplace add AlligatorC0der/conkurrence

Features

  • Multi-model evaluation — Run your schema against Bedrock, OpenAI, and Gemini models simultaneously
  • Statistical rigor — Fleiss' kappa with bootstrap confidence intervals, Kendall's W for validity
  • Self-consistency mode — No API keys needed; uses the host model via MCP Sampling
  • Schema suggestion — AI-powered schema design from your data
  • Trend tracking — Compare runs over time, detect agreement degradation
  • Cost estimation — Know the cost before running

MCP Tools

ToolDescription
conkurrence_runExecute an evaluation across multiple AI raters
conkurrence_reportGenerate a detailed markdown report
conkurrence_compareSide-by-side comparison of two runs
conkurrence_trendTrack agreement over multiple runs
conkurrence_suggestAI-powered schema suggestion from your data
conkurrence_validate_schemaValidate a schema before running
conkurrence_estimateEstimate cost and token usage

Links

License

BUSL-1.1 — Business Source License 1.1

Related Servers

NotebookLM Web Importer

Import web pages and YouTube videos to NotebookLM with one click. Trusted by 200,000+ users.

Install Chrome Extension