Every engineering system that matters is tested. Application code has unit tests. APIs have integration tests. Infrastructure has smoke tests. These are not optional practices — they are the mechanism by which you know whether the system works. Remove the tests, and you don't have a tested system; you have an untested one that may or may not behave as intended.

Governance systems are engineering systems. They have inputs (queries, file paths, task descriptions), processes (decision retrieval, context injection, evaluation), and outputs (verdicts, enforcement signals). And like any engineering system, their behavior must be testable — not by running them and hoping the result looks right, but by pre-registering what the result must be before the run, then evaluating whether the system produced it.

Verification contracts are that pre-registration mechanism. They are what make governance systems testable — and therefore improvable, auditable, and trustworthy.

What verification contracts actually mean

A verification contract specifies, before any code is generated, the complete expected behavior of a governance check against a specific scenario. The contract has four components:

  1. The query or scenario: The file path and task description that will be submitted to the governance system. This is the input — the thing the agent is about to do.
  2. The expected decision IDs: The specific decision records that must be retrieved from the decision corpus for this query. This specifies what Layer 1 (retrieval) must produce. If these decisions are not retrieved, the governance chain would have failed even if the enforcement layer didn't notice.
  3. The expected verdict: Whether the governance system should produce PASS or FAIL (or WEAK) for this scenario. This specifies what Layer 2 (enforcement) must produce.
  4. The expected failure terms: The specific strings that must appear in the model's output when governance is absent — confirming that the scenario genuinely tests a governed constraint, not a constraint the model would follow anyway. This is the baseline against which governance is measured.

All four components are written before the scenario runs. This is the pre-registration property: the contract specifies expected behavior before observation. After the scenario runs, the system's actual behavior is compared to the contract. Match on all four components: the governance system is working correctly for this scenario. Any mismatch: a specific, actionable failure signal.

A verification contract is not a test case that checks whether code is correct. It is a test case that checks whether the governance system is correct. The subject of the test is the governance layer itself — its retrieval behavior, its enforcement accuracy, and its ability to detect violations that the model would produce without constraints.

Why this problem exists in AI-native development

Governance systems for AI coding face a measurement problem that does not exist for simpler enforcement layers. A linter can be tested by running it against known-bad code and verifying it flags the right lines. The expected behavior is obvious. A governance system that uses LLM-based evaluation is more complex: the behavior depends on what decisions were retrieved, how they were formatted, what the model saw, and how it interpreted the constraints. The failure modes are not obvious — a governance system can appear to work while silently failing on specific scenario classes.

Without verification contracts, governance quality assessment is forced into post-hoc subjectivity. A team runs the governance system, looks at the verdicts, and asks: "Do these look right?" If they look right, the system is declared to be working. If they don't, the team investigates. But "looks right" is not a specification — it is a human judgment that varies between reviewers, changes as context changes, and cannot be automated.

The deeper problem: without pre-registration, the evaluation can be gamed — even unintentionally. A team evaluating their governance system's quality will naturally gravitate toward scenarios where the system works well. They run the scenarios, see good results, and report high quality. The scenarios that stress-test the system — edge cases, ambiguous queries, borderline constraint matches — are the ones that get excluded because they're hard to evaluate post-hoc. The result is a governance system with measured high quality that has never been tested on its most important failure modes.

Pre-registration prevents governance gaming by committing the expected behavior before the result is seen. You cannot adjust the expected verdict after seeing that the system produced FAIL instead of PASS. The contract is the specification; the system either satisfies it or doesn't. This is the same integrity property that makes pre-registration valuable in scientific studies: it prevents the specification from being adjusted to match the observed results.

The common misread: subjective evaluation as governance quality measurement

Teams without verification contracts typically assess governance quality through one of three approaches, all of which are structurally insufficient:

"LGTM" review: Human reviewers look at governance verdicts and approve them. This is subjective, inconsistent across reviewers, and cannot be automated. It also produces no baseline: there is no specification against which the verdicts are measured, so there is no objective definition of what "working" means.

Ad-hoc testing: The governance system is run against some scenarios and the results are inspected. Without pre-registration, the selection of scenarios is influenced by knowledge of which scenarios the system handles well. The coverage is unknown. Regressions are invisible — if the system previously handled a scenario correctly and now handles it incorrectly, there is no contract to compare against, so the regression goes undetected.

Post-hoc audit: The governance system's verdicts are audited periodically. This catches large failures but misses small regressions, boundary condition failures, and retrieval-layer failures that produce correct-looking verdicts through incorrect mechanisms. A system that produces PASS by failing to retrieve the relevant decision (and therefore never enforcing it) looks identical to a system that produces PASS by correctly enforcing the decision — unless you have a contract specifying which decisions must be retrieved.

Approach What it measures What it misses
"LGTM" review Whether verdicts look reasonable Retrieval failures, baseline confirmation, regressions
Ad-hoc testing Behavior on selected scenarios Coverage gaps, scenario selection bias, regression detection
Post-hoc audit Large failures over time Small regressions, mechanism correctness, retrieval layer
Verification contracts Full behavioral specification Nothing within contract scope — gaps are known and explicit

The table illustrates the structural difference: the first three approaches measure proxies for governance quality. Verification contracts measure governance quality directly, against a pre-specified expected behavior that cannot be adjusted post-hoc.

How this fits the AI SDLC

Verification contracts sit at the validation layer of the AI SDLC — layer 6, above governance and architectural control, below human oversight. They are the testing infrastructure for the governance layer itself. And like any testing infrastructure, they require deliberate investment.

In Mneme's benchmark methodology, verification contracts are implemented as scenario objects in the benchmark fixture. Each scenario is a complete contract:

# Mneme benchmark scenario — a complete verification contract { "id": "scen-payments-deprecated-client", "description": "Agent attempts to use deprecated PaymentsV1 client", "query": "services/payments/handler.py — add charge endpoint", "expected_protected_decision_ids": ["dec-payments-client-v2-only"], "expected_verdict": "FAIL", "expected_failure_terms": ["PaymentsV1", "payments_v1_client"] }

This contract specifies: for the query "services/payments/handler.py — add charge endpoint", the governance system must (1) retrieve decision ID "dec-payments-client-v2-only", (2) produce a FAIL verdict, and (3) the model without governance must produce output containing "PaymentsV1" or "payments_v1_client" (confirming the constraint is genuine, not hypothetical).

When the benchmark runs against this contract:

  • recall@K: Was "dec-payments-client-v2-only" in the top-K retrieved decisions? If not, the governance chain would have failed regardless of the enforcement layer's behavior.
  • pass_rate: Did the enforcement verdict match the expected verdict? If the governance system produced PASS where FAIL was expected, the constraint was not enforced.
  • Baseline confirmation: Did the model without governance produce the expected failure terms? If not, the scenario is not a genuine test of the governance constraint — it's testing behavior the model would exhibit anyway.

Each metric corresponds to a specific component of the governance system. recall@K measures the retrieval layer. pass_rate measures the enforcement layer. Baseline confirmation validates the scenario specification itself. The verification contract makes all three measurable, comparable across governance system versions, and automatable in CI.

The practical implication for governance system development: every change to the retrieval system, the decision corpus, or the enforcement logic must be validated against the full set of verification contracts before merge. A change that improves behavior on some scenarios while regressing on others is visible — the regressions show up as contract failures. Without contracts, that regression is invisible until a human happens to evaluate the affected scenario.

Verification contracts make governance systems improvable. Without them, you can observe that governance seems to be working — but you can't measure by how much, can't detect regressions, and can't compare governance system versions objectively. With contracts, every change to the governance system produces a measurable effect on contract satisfaction. The governance system becomes an engineered system with a test suite, not a black box with a hope attached.

Related concepts

Verification contracts are the testing layer for the governance layer. Three adjacent concepts describe what they test:

  • Deterministic enforcement — the property that the same query against the same decision corpus always produces the same enforcement result. Without determinism, verification contracts cannot be reliable: a contract might pass on one run and fail on another for the same scenario, making the contract itself untrustworthy. Determinism is the prerequisite for contracts being meaningful.
  • Governance benchmark v1.1 methodology — the full specification of how verification contracts are used in Mneme's benchmark suite: scenario structure, metric definitions (recall@K, pass_rate, WEAK_RETRIEVAL count), baseline confirmation methodology, and merge gate requirements. The benchmark is the operationalization of verification contracts at suite scale.
  • How retrieval works — the retrieval layer that verification contracts test at Layer 1. recall@K as a contract metric only makes sense in the context of how retrieval decisions are scored and ranked. Understanding the retrieval pipeline is the prerequisite for understanding why retrieval-layer contracts are necessary.