How is agent verification different from tests?

Tests answer whether the code behaves correctly for known inputs. Agent verification answers whether the change the agent made is allowed to exist in the first place — whether it respects architectural boundaries, dependency policies, layering contracts, and the ADRs the team has accumulated. A change can pass every test while violating every architectural decision the team has made.

How is agent verification different from observability?

Observability records what ran, when, with which tools, and what it cost. It tells you the run happened. Agent verification tells you whether the run preserved what was supposed to remain true — a different question with a different answer surface. You can have complete observability and zero verification, and never know the architecture is drifting.

Agent Verification — Mneme HQ Concepts

Q: What does agent verification actually verify?

Three categories: (1) architectural intent — ADRs, layering rules, dependency policies; (2) operational constraints — rate limits, security boundaries, allowed surfaces; (3) system invariants — properties that must hold true regardless of any specific change. Verification proves that each category survived the autonomous run by checking the run's outputs against a structured contract.

The defining problem of autonomous engineering is not whether agents can complete work. They can. It is whether the work, once completed, preserved the system the team is operating. Execution success is not architectural correctness. A long-running agent can ship features that pass every test, satisfy every reviewer, and deploy without incident — and still leave the codebase incrementally less coherent than it started. Verification is the layer that closes that gap.

Two gates, same run, different questions. Execution asks whether the work ran. Verification asks whether it should exist. Both have to pass.

Why execution success is not architectural correctness

Tests, build pipelines, deploys, and incident dashboards all answer one question: did the system stay up? That is the question they were designed for, and they answer it well. None of them answer the question that becomes load-bearing when generation is autonomous: was the change architecturally allowed to exist?

An autonomous agent shipping a feature can:

Introduce a forbidden dependency that the team explicitly decided to remove six months ago. The tests still pass. The build still ships. The decision is now silently violated.
Cross a layering boundary — a controller calling directly into a data layer the architecture forbids it from touching. The new call works. The boundary that existed for reasons does not.
Replace a governed pattern with a sensible-looking alternative. The replacement is functionally equivalent and locally cleaner. It also breaks an invariant that downstream systems depend on.
Mutate an infrastructure standard — how services are exposed, configured, or deployed — in a way that drifts away from the team's established pattern without anyone noticing in review.

Every one of those failures is undetectable by the execution gate. They surface, if they surface at all, weeks later as drift telemetry, an incident postmortem, or a senior engineer's complaint that "the codebase doesn't feel right anymore." That delay is the cost of having no verification gate.

What agent verification verifies

Agent verification operates on three categories of property. Each is structurally different from the others; each requires a different kind of contract to evaluate.

1. Architectural intent

The decisions the team has accumulated about how the system is structured — ADRs, layering rules, dependency policies, allowed patterns, deprecated patterns. Verification of architectural intent answers: does this change respect the active architectural decision graph? The contract is the decision graph itself, resolved deterministically against the change's scope.

2. Operational constraints

The constraints the agent is allowed to operate within during the run — rate limits on external APIs, security boundaries on which tools may touch which resources, allowed write surfaces, mandatory approval gates. Verification of operational constraints answers: did the run stay inside the operational envelope the team defined for autonomous work? The contract is the envelope specification.

3. System invariants

Properties that must hold true regardless of the specific change — "every public endpoint has authentication," "no service writes directly to another service's database," "every migration has a rollback path." Verification of invariants answers: did the run preserve every property that must always be true? The contract is the set of invariants, evaluated against the post-change state.

The three categories are independent. A change can satisfy architectural intent and operational constraints while violating an invariant; or satisfy invariants while drifting from intent. Verification has to evaluate each separately.

The contract is the substrate

Verification is only as good as the artifacts it evaluates against. A verification gate that runs without a structured contract is just opinion-as-CI — a senior engineer's heuristics encoded as a script, fragile, and unable to grow with the team. A verification gate that runs against a verification contract — a pre-registered, machine-evaluable assertion about what must remain true — produces a verdict that has the same shape every time the same conditions hold.

This is the substrate that makes verification an engineering discipline rather than a review style. The contract is committed to the repository alongside the code. The verification gate reads the contract and the change. The verdict is reproducible: same contract, same change, same verdict. That property — deterministic enforcement — is what makes verification something a team can trust at scale.

Verification across long-running runs

The case for agent verification gets sharper as runs get longer. A single PR from a junior engineer is governed by review, and a missed violation surfaces in the next refactor. A long-running autonomous workflow that touches dozens of files across many sessions does not have that backstop. Each session makes locally reasonable choices. The cumulative effect is drift — and drift is exactly what verification is designed to catch.

The asymmetry matters: as agent autonomy increases, the gap between execution success and architectural correctness widens. Verification is the layer that keeps that gap measurable and closable.

What verification is not

The category boundary is sharp. Verification is not the same as any of the adjacent disciplines it touches.

Discipline	Question it answers	What verification adds
Unit tests	Does the code do what the test asserts?	Was the code allowed to exist at all?
Eval harnesses	Did the model output match a benchmark?	Did the change satisfy the team's architectural contract?
Observability	What ran, when, how long, at what cost?	Was the run's effect on the system permitted?
Code review	Does a human approve the diff?	Does the diff pass a deterministic, scalable check?
Linters & static analysis	Are there obvious bugs or style errors?	Are the team's specific architectural decisions intact?

None of these disciplines compete with verification. Each answers a different question, and a serious team runs several of them in parallel. Verification fills the gap where the other gates do not have a structured answer.

Where verification sits in the runtime stack

Verification is the top layer of the agent infrastructure stack. It runs after the agent has produced output and before that output is treated as canonical — pre-commit, pre-PR, in CI, before deploy. Its inputs are the agent's diff and side effects; its evaluation substrate is the verification contract; its output is a verdict that gates progression of the run.

The companion layer beneath it is governance infrastructure — the layer that defines what must remain true. Governance encodes the team's intent; verification proves whether intent survived. Without governance, verification has nothing to evaluate against. Without verification, governance is documentation.

Governance defines what must remain true. Verification proves that it did. One layer without the other is incomplete.

The discipline, in one sentence

Tests answer whether code works. Eval answers whether output is good. Observability answers what happened. Verification answers whether intent survived — and is what makes "the agent completed the run" mean something the team can trust as architecturally correct, not just operationally green.

Related concepts

Verification contracts — the substrate verification evaluates against. Pre-registered, machine-evaluable assertions defining what a passing change must prove.
Governance infrastructure — the layer that defines what must remain true. Verification proves that it did.
Deterministic enforcement — the property that makes verification reproducible. Same contract, same change, same verdict.
Architectural drift — what an absent verification layer accumulates over time.
Enforcement provenance — the citable chain from a verification verdict back to the authoring ADR.

Agent verification

Why execution success is not architectural correctness

What agent verification verifies

1. Architectural intent

2. Operational constraints

3. System invariants

The contract is the substrate

Verification across long-running runs

What verification is not

Where verification sits in the runtime stack

The discipline, in one sentence

Related concepts

FAQ

Frequently asked questions