GOVERNANCEBENCH

A benchmark for AI agent governance.

GovernanceBench measures whether a platform can halt, audit, supervise, and preserve behavioral integrity across autonomous agent workflows.

Run the benchmark View the demo

Current Agentomy result

100/100

211 scenarios passed

100/100 across 5 dimensions. 211 of 211 scenarios passed, 0 failed, 0 skipped.

WHY IT EXISTS

Autonomous systems need measurable governance controls.

GovernanceBench exists because ungoverned AI agents acting autonomously can produce outcomes that their operators did not anticipate, authorize, or have the ability to halt. In March 2026, an AI system was used to accelerate a cryptographic discovery with global security implications. No governance layer existed between the AI system's capability and the outcome it produced. GovernanceBench measures whether a governance platform provides the halt capability, audit trail, and behavioral integrity controls that prevent this class of outcome from happening without operator awareness and consent.

The benchmark translates that risk into practical tests: can a platform block unauthorized actions, preserve a verifiable audit trail, contain sub-agent behavior, and keep a human supervisor in the control loop?

Halt capability

Can an operator stop an agent before damage compounds?

Audit trail

Can the system produce a verifiable record of what happened and why?

Behavioral integrity

Can the system detect and contain drift from expected behavior?

HTML Reports

Standalone HTML reports with color-coded scores, expandable scenario details, and print-friendly layout.

CONFIRMED RESULTS

Agentomy has confirmed all five suites.

Current results are published as confirmed test results, not certifications. 100/100 across 5 dimensions. 211 of 211 scenarios passed, 0 failed, 0 skipped.

Suite 1

Authorization Enforcement

100%

Confirmed test result

Tests whether permission tier enforcement is server-side, tier escalation via request body is blocked, and agents operate within designated scope.

Suite 2

Audit Trail Integrity

100%

Confirmed test result

Tests whether the audit trail is hash-linked, tamper-evident, paginated, and exportable with complete event coverage.

Suite 3

Kill Switch / Override

100%

Confirmed test result

Tests whether emergency halt stops all agents immediately, authorization is blocked during halt, and resume restores normal operation.

Suite 4

Behavioral Monitoring

100%

Confirmed test result

Tests runtime anomaly detection including frequency bursts, privilege probing, quarantine enforcement, and agent-scoped behavioral baselines.

Suite 5

OWASP Agentic Top 10

100%

Confirmed test result

Tests coverage of OWASP ASI-01 through ASI-10 risks including goal hijacking, tool misuse, identity abuse, and cascading failures.

Overall status Overall: 100/100 Excellent. 211 passed, 0 failed, 0 skipped.

SUITE BREAKDOWN

What each suite measures.

Suite 1 100%

Authorization Enforcement

Tests whether permission tier enforcement is server-side, tier escalation via request body is blocked, and agents operate within designated scope. 50 scenarios.

Confirmed test result

Suite 2 100%

Audit Trail Integrity

Tests whether the audit trail is hash-linked, tamper-evident, paginated, and exportable with complete event coverage. 50 scenarios.

Confirmed test result

Suite 3 100%

Kill Switch / Override

Tests whether emergency halt stops all agents immediately, authorization is blocked during halt, and resume restores normal operation. 50 scenarios.

Confirmed test result

Suite 4 100%

Behavioral Monitoring

Tests runtime anomaly detection including frequency bursts, privilege probing, quarantine enforcement, and agent-scoped behavioral baselines. 51 scenarios.

Confirmed test result

Suite 5 100%

OWASP Agentic Top 10

Tests coverage of OWASP ASI-01 through ASI-10 risks including goal hijacking, tool misuse, identity abuse, and cascading failures. 10 scenarios.

Confirmed test result

FIVE GOVERNANCE DIMENSIONS

GovernanceBench measures the controls that matter before autonomy scales.

Authorization

Can the system prevent unauthorized actions before they execute?

Auditability

Can the system produce a usable record of what happened, why, and under which policy?

Override

Can an operator halt or override the agent before damage compounds?

Behavioral

Can the system detect and contain drift from expected behavior?

OWASP

Can the system map agentic risks to recognized security categories?

RUN IT YOURSELF

GovernanceBench is designed to be run, not just cited.

Run the benchmark locally and compare governance controls across agent platforms as results are published.

Apache 2.0 licensed open standard. Initial package support is published through the Agentomy Agent toolchain.

$ npx governancebench

npm package GitHub

METHODOLOGY

Transparent tests. Repeatable scoring. No hidden claims.

GovernanceBench is organized around suites, dimensions, scenarios, expected behaviors, and pass/fail scoring. The methodology is designed so teams can inspect what is being measured before trusting the score.

Methodology documentation available on request.

Five dimensions. 211 scenarios total. Apache 2.0 licensed open standard.

COMPARE PLATFORMS

A benchmark becomes useful when multiple systems can be measured against it.

GovernanceBench is built to support platform comparison as public scores become available. No competitor results are claimed until they are published through the same methodology.

Platform	Suite 1	Suite 2	Suite 3	Suite 4	Suite 5	Overall	Status
Agentomy	100%	100%	100%	100%	100%	100/100	Published result
Other platforms	Pending publication	Pending publication	Pending publication	Pending publication	Pending publication	Pending	No public score yet

Run the benchmark. Review the demo. Decide what governance should prove.

Agentomy publishes the benchmark because governance should be measurable before autonomous agents become operational infrastructure.

Run npx governancebench View interactive demo