Kubernetes-MCP-Guard

AI-safe approval plan gated Kubernetes operations through MCP with OAuth, RBAC, audit, guardrails.

Documentación

🛡️ Kubernetes MCP Guard

Human-approved, AI-driven Kubernetes remediation through a guarded MCP gateway.

Remediation, by design:

Observer detects anomalies.
Planner proposes an evidence-backed plan.
Human reviewer approves out-of-band.
Executor runs only the approved digest-bound plan.
Everything is auditable.

📝 TL;DR

When something breaks, the system can collect evidence, propose a bounded fix, dry-run it, package it into a reviewable plan, and wait for a human to approve.

It is a security-first bridge between AI agents and Kubernetes, with out-of-band, OAuth-authenticated, human-in-the-loop (HITL), plan-based approval for every gateway-exposed mutation.

Why?

AI agents can help diagnose infrastructure problems, but giving them direct mutation access is risky. Kubernetes MCP Guard explores a safer pattern: agents may observe, dry-run, and propose bounded remediations, while humans approve the exact digest-bound plan before any Kubernetes write occurs.

🎬 Demo

https://github.com/user-attachments/assets/4e06b4ee-db80-4d74-96cc-38dfbb413042

[!NOTE] Demo scenario:

A Deployment is intentionally broken.

The Observer detects the unhealthy workload.

The Planner proposes a bounded remediation.

An approval access code is sent to the configured operator by email.

An authenticated human approves the exact plan in the browser.

The Executor applies the approved mutation.

The walkthrough in docs/demo-failing-deployment.md shows the full flow against a deliberately broken Deployment.

🧠 Core Ideas

Kubernetes MCP Guard explores a practical safety pattern for AI-assisted operations:

Plan before mutate: every gateway-exposed write starts as a request_* plan built from Kubernetes server-side dry-run evidence.
Separate review channel: the MCP client receives an approval URL, while approval happens through /approvals/* in a browser OAuth session.
Digest-bound approval: execution is bound to an Intent Digest for the executable mutation and a Review Digest for the human-reviewed snapshot.
Durable grant model: an approved Approval Challenge records a Challenge Outcome and issues an Approval Grant consumed by pre-execution gates.
Narrow Kubernetes scope: namespace allow-lists, namespace-scoped RBAC, supported-kind checks, and bounded read tools keep the operational surface small.
Auditable controls: guardrail and approval events are written as JSONL streams with identity, digest, grant, and execution context.
Structured multi-agent coordination: Observer, Planner, and Executor are independent processes (agents) that communicate over the A2A protocol (via a2a-dotnet), each with a separate OAuth service identity and a narrow gateway scope. The Planner owns a durable per-anomaly Task that persists across restarts and enforces one-remediation-per-anomaly without cross-service locking.

The repository also separates the generic approval lifecycle from the Kubernetes adapter, so the core language is not tied to one infrastructure domain.

_{See CONTEXT.md, docs/mutation-approval-profile.md, docs/mutation-approval-flow.md.}

🗺️ Architecture

---
title: Security Boundaries
---
flowchart TB
    subgraph outer["🌐 Internet / Operator"]
        Human["👤 Operator\nbrowser · OAuth PKCE"]
        McpClient["🤖 MCP Client\nCodex · Claude Code"]
    end

    subgraph gateway["🛡️ Gateway —  OAuth JWT required"]
        direction LR
        Guard["🔍 Guardrails\n+ ToolScopeGuard"]
        ApprovalUI["📋 Approval UI\n/approvals/*"]
        ApprovalCore["🔐 Approval Core\nplan · challenge · grant · digest"]
    end

    subgraph agents["🤖 Agent Tier  —  client_credentials · narrow scopes"]
        direction LR
        Obs["🔎 Observer\nmcp:tools.readonly"]
        Plan["📋 Planner\nmcp:tools.propose + readonly"]
        Exec["🛠️ Executor\nmcp:tools.execute"]
        Obs <-->|"A2A"| Plan <-->|"A2A"| Exec
    end

    subgraph private["🔒 Private Subprocess  —  no public port"]
        McpServer["⚙️ McpServer\nKubernetes tools"]
    end

    K8s(("☸️ Kubernetes API\n(namespace-scoped RBAC)"))

    Human -->|"review snapshot · approve/deny"| ApprovalUI --> ApprovalCore
    McpClient -->|"Bearer JWT · mcp:tools.read/write"| Guard -->|"scope-filtered tool call"| ApprovalCore
    agents -->|"Bearer JWT · service identity"| Guard
    ApprovalCore -->|"stdio · service token"| McpServer -->|"KubernetesClient"| K8s

The Observer notifies the Planner and the Planner dispatches to the Executor synchronously and waits for the outcome.

The Planner's internal remediation pipeline is a concurrent DAG built on Microsoft.Agents.AI.Workflows, fanning each incoming anomaly through independent: Filter → Dedupe → LLM-Decide → Validate → Propose executor chains.

Full request-flow diagrams live in docs/architecture.md.

🔐 Approval Flow

The central safety property is that approval is necessary but not sufficient. A human approval creates execution authorization, but execution still has to pass the pre-execution gates immediately before Kubernetes is mutated.

Phase	What happens	What can block it
Plan	A human-driven MCP client calls `request_*`, or the Planner calls `propose_plan`; the Kubernetes adapter gathers dry-run, diff, and policy evidence; the generic core stores a Plan Envelope with Intent and Review Digests.	Namespace rejection, manifest allow-list rejection, dry-run failure, domain policy denial, unsupported legacy plan format.
Approve	The client calls `execute_approved_plan`; the gateway creates or reuses a short-lived Approval Challenge and returns a browser URL. The browser renders the stored review snapshot, not model-supplied approval text.	Expired challenge, wrong authenticated subject, anti-forgery failure, changed digest binding, denied/rejected/canceled Challenge Outcome.
Execute	After approval, the client retries `execute_approved_plan`; the gateway validates the Approval Grant, digests, validity window, reuse policy, freshness checks, and domain policy checks before the adapter writes.	Missing/expired/mismatched grant, digest mismatch, already-applied Single-Execution Plan, second dry-run failure, policy failure, live-state drift.

_{Current implementation notes are tracked in docs/mutation-approval-profile.md#current-repository-fit.}

🧰 Current Capabilities

🤖🔎 Anomaly Observer

The InfraGate.Observer is an LLM-driven agent that periodically inspects the cluster through the gateway's read-only tools and emits structured Anomaly Reports.

Capability	Description
Scheduled observation	Background `IHostedService` runs cycles on a configurable cadence (default 60s).
On-demand trigger	`POST /observe-now` returns a synchronous `AnomalyReport[]` with a 30s timeout.
Anomaly detection	LLM-assisted classification across four categories: Pod unhealthy, Deployment unavailable, Service no endpoints, Warning events.
Severity classification	Rules-derived `High`/`Medium`/`Low` with LLM disagreement telemetry.
Deduplication & resolution	In-memory dedupe window suppresses repeat reports; automatic `Resolved` emission when anomalies clear.
Handoff	Log sink always on; JSON file sink and Planner A2A handoff are opt-in; see docs/configuration.md.

🤖📋 Remediation Planner

The InfraGate.Planner consumes Anomaly Reports, chooses a bounded remediation operation, and creates approval-pending Operator Approval Policy plans through propose_plan.

Capability	Description
Anomaly intake	Receives `AnomalyHandoffBatch` payloads from the Observer over A2A; each anomaly is processed independently through a concurrent DAG pipeline: Filter → Dedupe → LLM-Decide → Validate → Propose.
Operation menu	Chooses only `restart_deployment`, `scale_deployment`, or `set_deployment_image` in v1.
Plan proposal	Calls `propose_plan` to create a digest-bound Plan Envelope for operator approval.
Approval notification	`propose_plan` creates an Approval Access Code and sends the configured operator email through the gateway SMTP sender when configured.
Durable task lifecycle	One A2A Task per anomaly (keyed by `contextId`) tracks state from `Submitted` through `Working`, `AuthRequired` (awaiting operator approval), to `Completed`/`Failed`/`Rejected`. Persisted to PostgreSQL when `InfraGate__Planner__AuditConnectionString` is set; otherwise in-memory.
Scope boundary	Planner can propose plans and use read-only inspection tools; it cannot execute plans.

🤖🛠️ Remediation Executor

The InfraGate.Executor consumes Planner proposals, waits for approval, and executes only after the gateway reports that an Approval Grant exists.

Capability	Description
Proposal intake	Receives plan ids from the Planner over synchronous A2A dispatch.
Approval wait	Calls `wait_for_plan_approval` for each plan id until approval, timeout, or terminal status.
Approved execution	Calls `execute_approved_plan` only after approval is reported.
Scope boundary	Executor can wait and execute approved plans; it cannot create plans or call read-only inspection tools.
Gateway gates	The gateway still enforces approval grants, digests, freshness, policy checks, and single execution.

🛡️ Gateway Protections

Layer	Current behavior
MCP transport	HTTP MCP endpoint at `/mcp` using Streamable HTTP.
Authentication	OAuth JWT validation for MCP calls; browser OAuth cookie for approval pages.
OAuth discovery	Protected-resource metadata and insufficient-scope challenges for MCP clients.
Approval authority	Browser approval endpoints under `/approvals/*` with same-subject binding and anti-forgery checks.
Guardrails	Warn on suspicious request patterns and redact suspicious response content before it returns to the MCP client.
Audit	Separate JSONL streams for guardrail findings and approval lifecycle events.

🔎 Read-Only Observability

Tool	Purpose
`get_allowed_namespaces`	Return the namespace allow-list configured for the server.
`get_k8s_status`	Summarize Deployments, Services, ConfigMaps, Pods, and ReplicaSets in a namespace.
`get_k8s_events`	Read bounded `events.k8s.io/v1` diagnostics.
`get_pod_logs`	Read bounded Pod logs with tail-line and byte caps.
`get_k8s_resource`	Return a focused resource summary without Secret values, ConfigMap data, or raw manifests.
`get_deployment_diagnostics`	Inspect Deployment health, related Pods, ReplicaSets, and Events.
`get_pod_diagnostics`	Inspect Pod status, conditions, container state, and Events.
`get_service_diagnostics`	Inspect Service endpoints, backing Pods, and Events.

✅ Gateway Approval Tools

Tool	Purpose
`request_apply_manifest`	Dry-run and plan server-side apply for `Deployment`, `Service`, or `ConfigMap`.
`request_delete_manifest`	Dry-run and plan deletion for supported manifest kinds.
`request_scale_deployment`	Dry-run and plan a Deployment replica-count change.
`request_restart_deployment`	Dry-run and plan a Deployment rollout restart.
`request_set_deployment_image`	Dry-run and plan a Deployment container image update.
`propose_plan`	Create an approval-pending Operator Approval Policy plan for the autonomous Planner operation menu.
`execute_approved_plan`	Create the browser approval challenge or execute an approved, digest-bound plan after gates pass.
`get_plan_status`	Read the current approval status for a plan.
`wait_for_plan_approval`	Wait briefly for an out-of-band browser approval and return status JSON without applying the plan.

Direct Kubernetes mutation tools exist inside the private server surface for the adapter executor. The HTTP gateway exposes request_* wrappers plus execute_approved_plan instead of exposing raw destructive tools to MCP clients.

⚡ Quick Start

Prerequisites: Docker Compose v2, kubectl, minikube, and git.

Review docs/configuration.md before changing runtime settings.

📦 From Packages

The default quickstart uses published images and committed local-demo defaults.

git clone https://github.com/mirusser/Kubernetes-MCP-Guard.git
cd Kubernetes-MCP-Guard

export InfraGate__OpenRouter__ApiKey="<openrouter-api-key>"
make quickstart

make quickstart starts the local Keycloak-backed OAuth path, PostgreSQL approval store, and published gateway image with TAG=latest. Pin a release with TAG=v0.1.0 make quickstart. The committed no-SDK defaults come from the smoke-release Run Profile: deploy/local-oauth/release.env.example supplies both Compose interpolation and InfraGate__... runtime settings.

🛠️ From Source

Use source mode when you want the gateway, Observer, Planner, and Executor built from local code. This path also requires the .NET 10 SDK and an OpenRouter API key for the LLM-backed agents.

export InfraGate__OpenRouter__ApiKey="<openrouter-api-key>"
make quickstart-source

The source quickstart generates deploy/generated/local-compose.env (default configuration) from deploy/run-profiles.yaml and starts the gateway, Observer, Planner, and Executor from local source builds.

Useful follow-up commands:

make quickstart-logs
make quickstart-down

Other run modes and full setup details are in docs/setup-guide.md.

⌨️ Connect Codex CLI

Add this to ~/.codex/config.toml:

[mcp_servers.infra-gate]
url = "http://127.0.0.1:3001/mcp"
oauth_resource = "http://127.0.0.1:3001/mcp"
scopes = ["mcp:tools.read"]

Use mcp:tools.write for sessions where you intend to create and apply mutation plans. The legacy mcp:tools scope grants full access for backward compatibility.

Then authenticate and start Codex:

codex mcp login infra-gate
codex

💬 Connect Claude Code

claude mcp add-json --scope user infra-gate \
  '{"type":"http","url":"http://127.0.0.1:3001/mcp","oauth":{"scopes":"mcp:tools.read"}}'

claude
/mcp

📦 Container Images

Release images are built by the Docker workflow and published to GHCR and Docker Hub.

Registry	Gateway image
GitHub Container Registry	`ghcr.io/mirusser/kubernetes-mcp-guard-gateway:<tag>`
Docker Hub	`mirusser/kubernetes-mcp-guard-gateway:<tag>`

Use specific release tags for stable demos. The :dev tag tracks the development branch, and :latest tracks the most recent stable release.

🧩 Compatibility

Area	Supported / tested
.NET	.NET 10
Kubernetes	minikube / local cluster initially
MCP transport	HTTP MCP endpoint at `/mcp`
OIDC	Keycloak local/dev path; external OIDC providers by configuration
Container registries	GHCR, Docker Hub
Platforms	linux/amd64 initially

🧭 Project Map

Developer runbook, local runs, MCP tool contracts, and verification: docs/devs-readme.md.
Setup paths, run profiles, environment variables, and production guidance: docs/setup-guide.md and docs/configuration.md.
docs/architecture.md, docs/security-model.md, docs/tool-permissions.md: request flows, safety boundaries, and per-tool permissions.
Runtime services: McpGateway, McpServer, Observer, Planner, and Executor.
Approval and Kubernetes domain: Approvals, Approvals.Postgres, and KubernetesAdapter.
Validation and demos: tests, failing-deployment example, and local SonarQube.

⚖️ Boundaries And Non-Goals

[!IMPORTANT]

The project is experimental and not production-certified.

The local Keycloak realm runs in development mode over HTTP and is not a production identity provider.

Prompt-injection guardrails are defense-in-depth, not a guaranteed hard security boundary.

The tool surface does not expose shell execution, kubectl passthrough, exec, attach, port-forward, namespace creation, RBAC manipulation, Secret reads, raw manifest reads, or cluster-scoped writes.

This is not a full Kubernetes policy engine and not an MCP standard.

See docs/security-model.md for the full threat model.

_{It is a working reference implementation for a possible MCP mutation-approval profile, designed for early technical evaluation in local or tightly controlled environments, not production-certified infrastructure.}

_{The codebase uses InfraGate as the internal project name.}

📜 Governance

License: Apache-2.0
Security policy: SECURITY.md
Contributing guide: CONTRIBUTING.md
Changelog: CHANGELOG.md
Release process: docs/releasing.md

_{Built with ❤️, ☕ and careful little guardrails 🛡️✨}