GOVERNANCEBENCH

A benchmark for AI agent governance.

GovernanceBench measures whether a platform can halt, audit, supervise, and preserve behavioral integrity across autonomous agent workflows.

Current Agentomy result
100/100
211 scenarios passed
100/100 across 5 dimensions. 211 of 211 scenarios passed, 0 failed, 0 skipped.

Autonomous systems need measurable governance controls.

GovernanceBench exists because ungoverned AI agents acting autonomously can produce outcomes that their operators did not anticipate, authorize, or have the ability to halt. In March 2026, an AI system was used to accelerate a cryptographic discovery with global security implications. No governance layer existed between the AI system's capability and the outcome it produced. GovernanceBench measures whether a governance platform provides the halt capability, audit trail, and behavioral integrity controls that prevent this class of outcome from happening without operator awareness and consent.

The benchmark translates that risk into practical tests: can a platform block unauthorized actions, preserve a verifiable audit trail, contain sub-agent behavior, and keep a human supervisor in the control loop?

Halt capability
Can an operator stop an agent before damage compounds?
Audit trail
Can the system produce a verifiable record of what happened and why?
Behavioral integrity
Can the system detect and contain drift from expected behavior?
HTML Reports
Standalone HTML reports with color-coded scores, expandable scenario details, and print-friendly layout.

Agentomy has confirmed all five suites.

Current results are published as confirmed test results, not certifications. 100/100 across 5 dimensions. 211 of 211 scenarios passed, 0 failed, 0 skipped.

Suite 1
Authorization Enforcement
100%
Confirmed test result
Tests whether permission tier enforcement is server-side, tier escalation via request body is blocked, and agents operate within designated scope.
Suite 2
Audit Trail Integrity
100%
Confirmed test result
Tests whether the audit trail is hash-linked, tamper-evident, paginated, and exportable with complete event coverage.
Suite 3
Kill Switch / Override
100%
Confirmed test result
Tests whether emergency halt stops all agents immediately, authorization is blocked during halt, and resume restores normal operation.
Suite 4
Behavioral Monitoring
100%
Confirmed test result
Tests runtime anomaly detection including frequency bursts, privilege probing, quarantine enforcement, and agent-scoped behavioral baselines.
Suite 5
OWASP Agentic Top 10
100%
Confirmed test result
Tests coverage of OWASP ASI-01 through ASI-10 risks including goal hijacking, tool misuse, identity abuse, and cascading failures.
Overall status Overall: 100/100 Excellent. 211 passed, 0 failed, 0 skipped.

What each suite measures.

Suite 1 100%
Authorization Enforcement
Tests whether permission tier enforcement is server-side, tier escalation via request body is blocked, and agents operate within designated scope. 50 scenarios.
Confirmed test result
Suite 2 100%
Audit Trail Integrity
Tests whether the audit trail is hash-linked, tamper-evident, paginated, and exportable with complete event coverage. 50 scenarios.
Confirmed test result
Suite 3 100%
Kill Switch / Override
Tests whether emergency halt stops all agents immediately, authorization is blocked during halt, and resume restores normal operation. 50 scenarios.
Confirmed test result
Suite 4 100%
Behavioral Monitoring
Tests runtime anomaly detection including frequency bursts, privilege probing, quarantine enforcement, and agent-scoped behavioral baselines. 51 scenarios.
Confirmed test result
Suite 5 100%
OWASP Agentic Top 10
Tests coverage of OWASP ASI-01 through ASI-10 risks including goal hijacking, tool misuse, identity abuse, and cascading failures. 10 scenarios.
Confirmed test result

GovernanceBench measures the controls that matter before autonomy scales.

01
Authorization
Can the system prevent unauthorized actions before they execute?
02
Auditability
Can the system produce a usable record of what happened, why, and under which policy?
03
Override
Can an operator halt or override the agent before damage compounds?
04
Behavioral
Can the system detect and contain drift from expected behavior?
05
OWASP
Can the system map agentic risks to recognized security categories?

GovernanceBench is designed to be run, not just cited.

Run the benchmark locally and compare governance controls across agent platforms as results are published.

Apache 2.0 licensed open standard. Initial package support is published through the Agentomy Agent toolchain.

$ npx governancebench
npm package GitHub

Transparent tests. Repeatable scoring. No hidden claims.

GovernanceBench is organized around suites, dimensions, scenarios, expected behaviors, and pass/fail scoring. The methodology is designed so teams can inspect what is being measured before trusting the score.

Methodology documentation available on request.

Five dimensions. 211 scenarios total. Apache 2.0 licensed open standard.

A benchmark becomes useful when multiple systems can be measured against it.

GovernanceBench is built to support platform comparison as public scores become available. No competitor results are claimed until they are published through the same methodology.

Platform Suite 1 Suite 2 Suite 3 Suite 4 Suite 5 Overall Status
Agentomy 100% 100% 100% 100% 100% 100/100 Published result
Other platforms Pending publication Pending publication Pending publication Pending publication Pending publication Pending No public score yet

Run the benchmark. Review the demo. Decide what governance should prove.

Agentomy publishes the benchmark because governance should be measurable before autonomous agents become operational infrastructure.