conKurrence

AI evaluation toolkit — measure inter-rater agreement (Fleiss' κ, Kendall's W) across multiple LLM providers

ConKurrence

One command. Find out if your AI agrees with itself.

ConKurrence is a statistically validated consensus measurement toolkit for AI evaluation pipelines. It uses multiple AI models as independent raters, measures inter-rater reliability with Fleiss' kappa and bootstrap confidence intervals, and routes contested items to human experts.

Install

npm install -g conkurrence

MCP Server

Use ConKurrence as an MCP server in Claude Desktop or any MCP-compatible client:

npx conkurrence mcp

Claude Desktop Configuration

Add to your claude_desktop_config.json:

{
  "mcpServers": {
    "conkurrence": {
      "command": "npx",
      "args": ["-y", "conkurrence", "mcp"]
    }
  }
}

Claude Code Plugin

/plugin marketplace add AlligatorC0der/conkurrence

Features

  • Multi-model evaluation — Run your schema against Bedrock, OpenAI, and Gemini models simultaneously
  • Statistical rigor — Fleiss' kappa with bootstrap confidence intervals, Kendall's W for validity
  • Self-consistency mode — No API keys needed; uses the host model via MCP Sampling
  • Schema suggestion — AI-powered schema design from your data
  • Trend tracking — Compare runs over time, detect agreement degradation
  • Cost estimation — Know the cost before running

MCP Tools

ToolDescription
conkurrence_runExecute an evaluation across multiple AI raters
conkurrence_reportGenerate a detailed markdown report
conkurrence_compareSide-by-side comparison of two runs
conkurrence_trendTrack agreement over multiple runs
conkurrence_suggestAI-powered schema suggestion from your data
conkurrence_validate_schemaValidate a schema before running
conkurrence_estimateEstimate cost and token usage

Links

License

BUSL-1.1 — Business Source License 1.1

Serveurs connexes

NotebookLM Web Importer

Importez des pages web et des vidéos YouTube dans NotebookLM en un clic. Utilisé par plus de 200 000 utilisateurs.

Installer l'extension Chrome