Governance Benchmark v1.1 is a deterministic, reproducible benchmark measuring whether Mneme reduces architectural rule violations in AI-generated code. 36 scenarios across six categories. Structured-output verification, layered retrieval and enforcement metrics, pre-registered thresholds, and fully auditable scoring. No subjective grading.
Methodology · v1.1

Governance Benchmark v1.1 methodology.

Published May 6, 2026 Status Methodology live, full results pending suite validation License MIT

This is the methodology specification for the Mneme governance benchmark. It describes what we measure, how we score, what we explicitly do not claim, and how anyone can reproduce the numbers. The intent is methodology before metrics. We are publishing the spec first so it can be pressure-tested before any numbers attach to it.

01 — Purpose

The Mneme Governance Benchmark validates whether Mneme reduces architectural rule violations in AI-generated code. It answers one question:

When an AI coding agent is asked to make a change in a codebase that has known architectural constraints, does Mneme produce code that respects those constraints more often than the same agent without Mneme, without over-enforcing on changes where the constraints do not apply?

The "without over-enforcing" clause is doing real work. A governance system that catches violations by refusing every change is not helpful, and the original v1.0 draft did not measure this rigorously enough. v1.1 corrects that.

02 — Scope

In scope. Pre-generation governance. The benchmark measures the rate at which the model's generated output respects a given architectural rule when Mneme injects the relevant rule, versus when it does not, separately from whether Mneme retrieved the right rule in the first place.

Out of scope. Post-generation observability, runtime enforcement, security scanning, performance optimization, code style preferences. Those are separate measurements with separate tooling.

03 — Layered measurement

v1.0 measured a single end-to-end outcome. v1.1 separates this into two layers because conflating them hides real product issues.

Layer 1: retrieval

Did Mneme surface the right rules for this prompt? Each scenario specifies which rules from the project's decision store should be retrieved. Mneme's retrieval is run independently and scored against ground truth.

Layer 2: enforcement

Given the right rule was retrieved, did the governed output respect it? Two sub-measurements:

The delta between end-to-end and oracle enforcement attributes failures to retrieval versus injection-and-judgment.

Why this matters. A user who hits a violation needs to know which layer broke. Without the layer split, a benchmark publishes a single number that hides which part of the system failed. With the layer split, retrieval issues and enforcement issues are diagnosed separately and fixed independently.

04 — Structured output protocol

v1.0 verified by parsing freeform model output. This is gameable: a model can write the right words in prose while violating the rule in code. v1.1 closes this attack surface by requiring the model to emit a structured artifact.

Response schema

The system prompt instructs the model to respond with JSON conforming to this schema. Verifiers operate on the parsed artifact, not on prose.

{
  "files_to_modify": [
    { "path": "src/foo.py", "new_content": "..." }
  ],
  "files_to_create": [
    { "path": "src/bar.py", "content": "..." }
  ],
  "files_to_delete": [
    { "path": "src/old.py" }
  ],
  "new_dependencies": [
    { "name": "requests", "version_constraint": ">=2.31" }
  ],
  "rationale": "Free text. Not used for scoring.",
  "refused": false,
  "refusal_reason": null,
  "implementation_plan": [
    "Step 1: ...",
    "Step 2: ..."
  ]
}

Verifiers never read rationale or implementation_plan for scoring. Those fields exist for human auditing of borderline cases. The single exception is the verifier for ambiguous scenarios, which IS allowed to inspect rationale because the question being asked is whether the model surfaced ambiguity at all.

If the model refuses, refused: true and refusal_reason is required. A refusal is a candidate respected outcome only if the rule actually applied; on a control scenario, a refusal is a false positive.

Why this closes gaming

A model cannot earn a respected verdict by writing the right words. It must produce code that, when parsed, contains the right files, the right imports, and no forbidden patterns. The verifier inspects facts about the artifact, not claims about it.

05 — Suite composition

36 scenarios across six categories. Controls and ambiguous cases are first-class.

CategoryCountPurpose
Architectural violations8Core architectural rules: extend before rebuild, no parallel systems, layering
Scope and boundary6Module boundaries, API contracts, public surface stability
Anti-pattern violations6Singletons, god objects, hidden globals, banned approaches
Dependency and tooling4Forbidden imports, banned libraries, lockfile changes
Ambiguous / borderline4Multiple rules apply, partial conflicts, unclear scope
Control / non-applicable8Rules exist, prompts unrelated, correct behavior is no intervention

Controls at 22 percent of the suite. Anything below 20 percent makes false positive rate non-credible.

06 — Source mix

A synthetic-only suite is not credible. v1.1 requires real-incident scenarios.

SourceCountNotes
Synthetic canonical12Constructed to test specific rules cleanly
Real Mneme repo drift6Drawn from prompts that produced drift in development
Real CannabisDealsUS drift4Same source, different domain
Adversarial / edge6Prompts crafted to look innocent but hit a rule
Controls8Rule exists, prompt unrelated

Real-incident scenarios are 28 percent of the suite, satisfying the credibility floor.

07 — Difficulty calibration

Calibration is verified empirically by running the baseline against the suite before publication. If the actual baseline violation rate on a "medium" scenario is 90 percent, it is reclassified as hard. The labels describe observed difficulty, not assumed difficulty.

08 — Verdict thresholds

All four are required. Pre-committed and will not move at decision time.

≥ 75%
Violation prevention rate
On non-control scenarios where the baseline violated. Did Mneme catch real drift?
≥ 50%
Baseline drift rate
Across non-control scenarios. Is there a real problem to catch?
≤ 10%
False positive rate
On control scenarios. Does Mneme over-enforce?
≤ 10pp
Oracle gap
End-to-end enforcement within 10 percentage points of oracle. Is retrieval doing its job?

If results land below threshold, the published narrative is honest. "v1.1 measured X. Below the 75 percent target. Here is what we changed and what we will fix." Pre-registration is the point.

09 — Metrics

Headline metrics

  1. Violation prevention rate (Layer 2 outcome)
  2. Baseline drift rate (problem severity)
  3. False positive rate (overreach)

Diagnostic metrics

  1. Retrieval recall at 5 (Layer 1)
  2. Retrieval precision at 5 (Layer 1)
  3. Irrelevant injection rate (Layer 1)
  4. End-to-end vs oracle enforcement gap (Layer split)
  5. Ambiguous scenario escalation rate
  6. Determinism variance

Reported but non-graded

  1. Mean tokens per request, governed vs baseline
  2. Mean latency, governed vs baseline

10 — Run protocol

  1. Each scenario runs in three iterations to measure determinism.
  2. Two variants per iteration: baseline (no Mneme) and governed (Mneme end-to-end).
  3. One additional pass per scenario for oracle enforcement (ground-truth rules injected, retrieval bypassed).
  4. Total runs per scenario: 9. Total runs for the full suite at 36 scenarios: 324.
  5. Same model, same temperature, same context window across baseline and governed.
  6. Codebase context loaded from versioned fixture; fixture hashes recorded with results.

11 — What this proves and does not prove

Does not prove

Does prove (if thresholds hit)

Under controlled conditions on a 36-scenario suite covering six categories with 28 percent real-incident sourcing and 22 percent controls, Mneme reduces architectural rule violations by at least 75 percent relative to the same model without Mneme, while triggering false positives no more than 10 percent of the time on unrelated changes.

Narrow claim. Pre-registered thresholds. Reproducible.

12 — Reproducibility

Published artifacts:

  1. Full scenario suite as YAML.
  2. Verification functions as Python.
  3. Harness runner as Python.
  4. Exact model versions tested (vendor, model name, snapshot date).
  5. Exact Mneme version and commit hash.
  6. Codebase fixture archive with content hashes.
  7. Raw structured outputs of every run.
  8. Aggregated results with full metric breakdown.
  9. This methodology document, version-tagged.

Anyone clones the repository, pins the same model version, runs the harness, and confirms the numbers within determinism variance. If they cannot, the published numbers are wrong.

13 — Versioning

Comparing v1.1 results to v1.2 results is meaningful only on the scenario subset that exists unchanged in both. Publications make this explicit.

14 — Limitations

Honest limitations published alongside results:

15 — Changelog from v1.0 to v1.1

ChangeReason
Suite size 30 to 36Add 8 control + 4 ambiguous scenarios
Single layer to two layersConflating retrieval and enforcement hid real failure modes
Prose verification to structured JSONClosed verifier-gaming attack
5 categories to 6Real governance is not always binary
Synthetic-only to mixed sourcingSynthetic-only is not credible externally
3 thresholds to 4Oracle gap prevents inflated end-to-end numbers
3 iterations + 1 oracle passRequired for retrieval and enforcement separation
deterministic reproducible pre-registered structured-output layered metrics controls included

Methodology before metrics.

Full benchmark results publish after scenario suite validation. Methodology and harness are public now.