What does the Mneme Governance Benchmark measure?

It measures whether Mneme reduces architectural rule violations in AI-generated code under controlled conditions, separately from whether the right rule was retrieved. The benchmark reports violation prevention rate, baseline drift rate, false positive rate, retrieval recall and precision at K, irrelevant injection rate, and the gap between end-to-end and oracle enforcement.

Why structured-output verification instead of grading prose?

Prose grading is gameable. A model can earn credit by writing the right words while still violating in the code. Mneme requires the model to emit structured JSON conforming to a fixed response schema, and verifiers operate on that artifact. Verdicts are based on observable facts in the output, not on claims about the output.

Why are controls 22 percent of the suite?

Controls are scenarios where the rule does not apply and the correct behavior is no intervention. Without controls, a governance system can appear effective by simply over-blocking, which would be unhelpful in production. Controls are the only credible way to measure false positive rate. Anything below 20 percent makes the false positive number suspect.

What are the verdict thresholds?

All four are required: violation prevention rate at or above 75 percent, baseline drift rate at or above 50 percent, false positive rate at or below 10 percent, and end-to-end enforcement within 10 percentage points of oracle enforcement. Thresholds are pre-registered and will not move at decision time.

Why publish methodology before results?

Methodology gameability is a bigger risk than slow results. Publishing the methodology first lets the community pressure-test it before numbers exist. Publishing numbers without methodology invites attacks the publisher cannot win. The order is deliberate and signals that Mneme treats benchmark rigor as a first-class concern.

Governance Benchmark v1.1 Methodology

01 — Purpose

The Mneme Governance Benchmark validates whether Mneme reduces architectural rule violations in AI-generated code. It answers one question:

When an AI coding agent is asked to make a change in a codebase that has known architectural constraints, does Mneme produce code that respects those constraints more often than the same agent without Mneme, without over-enforcing on changes where the constraints do not apply?

The "without over-enforcing" clause is doing real work. A governance system that catches violations by refusing every change is not helpful, and the original v1.0 draft did not measure this rigorously enough. v1.1 corrects that.

02 — Scope

In scope. Pre-generation governance. The benchmark measures the rate at which the model's generated output respects a given architectural rule when Mneme injects the relevant rule, versus when it does not, separately from whether Mneme retrieved the right rule in the first place.

Out of scope. Post-generation observability, runtime enforcement, security scanning, performance optimization, code style preferences. Those are separate measurements with separate tooling.

03 — Layered measurement

v1.0 measured a single end-to-end outcome. v1.1 separates this into two layers because conflating them hides real product issues.

Layer 1: retrieval

Did Mneme surface the right rules for this prompt? Each scenario specifies which rules from the project's decision store should be retrieved. Mneme's retrieval is run independently and scored against ground truth.

Recall at K. Of the rules that should have been retrieved, what fraction made it into the top K. K is typically 5.
Precision at K. Of the rules retrieved into the top K, what fraction were actually relevant.
Irrelevant injection rate. Fraction of governed runs where Mneme injected at least one rule the scenario marks as irrelevant.

Layer 2: enforcement

Given the right rule was retrieved, did the governed output respect it? Two sub-measurements:

End-to-end enforcement. Run the governed pipeline end-to-end. Score the output.
Oracle enforcement. Bypass retrieval. Inject ground-truth rules directly. Score the output.

The delta between end-to-end and oracle enforcement attributes failures to retrieval versus injection-and-judgment.

Why this matters. A user who hits a violation needs to know which layer broke. Without the layer split, a benchmark publishes a single number that hides which part of the system failed. With the layer split, retrieval issues and enforcement issues are diagnosed separately and fixed independently.

04 — Structured output protocol

v1.0 verified by parsing freeform model output. This is gameable: a model can write the right words in prose while violating the rule in code. v1.1 closes this attack surface by requiring the model to emit a structured artifact.

Response schema

The system prompt instructs the model to respond with JSON conforming to this schema. Verifiers operate on the parsed artifact, not on prose.

{
  "files_to_modify": [
    { "path": "src/foo.py", "new_content": "..." }
  ],
  "files_to_create": [
    { "path": "src/bar.py", "content": "..." }
  ],
  "files_to_delete": [
    { "path": "src/old.py" }
  ],
  "new_dependencies": [
    { "name": "requests", "version_constraint": ">=2.31" }
  ],
  "rationale": "Free text. Not used for scoring.",
  "refused": false,
  "refusal_reason": null,
  "implementation_plan": [
    "Step 1: ...",
    "Step 2: ..."
  ]
}

Verifiers never read rationale or implementation_plan for scoring. Those fields exist for human auditing of borderline cases. The single exception is the verifier for ambiguous scenarios, which IS allowed to inspect rationale because the question being asked is whether the model surfaced ambiguity at all.

If the model refuses, refused: true and refusal_reason is required. A refusal is a candidate respected outcome only if the rule actually applied; on a control scenario, a refusal is a false positive.

Why this closes gaming

A model cannot earn a respected verdict by writing the right words. It must produce code that, when parsed, contains the right files, the right imports, and no forbidden patterns. The verifier inspects facts about the artifact, not claims about it.

05 — Suite composition

36 scenarios across six categories. Controls and ambiguous cases are first-class.

Category	Count	Purpose
Architectural violations	8	Core architectural rules: extend before rebuild, no parallel systems, layering
Scope and boundary	6	Module boundaries, API contracts, public surface stability
Anti-pattern violations	6	Singletons, god objects, hidden globals, banned approaches
Dependency and tooling	4	Forbidden imports, banned libraries, lockfile changes
Ambiguous / borderline	4	Multiple rules apply, partial conflicts, unclear scope
Control / non-applicable	8	Rules exist, prompts unrelated, correct behavior is no intervention

Controls at 22 percent of the suite. Anything below 20 percent makes false positive rate non-credible.

06 — Source mix

A synthetic-only suite is not credible. v1.1 requires real-incident scenarios.

Source	Count	Notes
Synthetic canonical	12	Constructed to test specific rules cleanly
Real Mneme repo drift	6	Drawn from prompts that produced drift in development
Real CannabisDealsUS drift	4	Same source, different domain
Adversarial / edge	6	Prompts crafted to look innocent but hit a rule
Controls	8	Rule exists, prompt unrelated

Real-incident scenarios are 28 percent of the suite, satisfying the credibility floor.

07 — Difficulty calibration

Easy (30 percent of non-controls). Baseline LLM violates 20 to 40 percent of the time. Tests that Mneme does not regress easy cases.
Medium (50 percent of non-controls). Baseline violates 40 to 70 percent. The meaningful scenarios.
Hard (20 percent of non-controls). Baseline violates above 70 percent. Where Mneme provides the most lift.

Calibration is verified empirically by running the baseline against the suite before publication. If the actual baseline violation rate on a "medium" scenario is 90 percent, it is reclassified as hard. The labels describe observed difficulty, not assumed difficulty.

08 — Verdict thresholds

All four are required. Pre-committed and will not move at decision time.

≥ 75%

Violation prevention rate

On non-control scenarios where the baseline violated. Did Mneme catch real drift?

≥ 50%

Baseline drift rate

Across non-control scenarios. Is there a real problem to catch?

≤ 10%

False positive rate

On control scenarios. Does Mneme over-enforce?

≤ 10pp

Oracle gap

End-to-end enforcement within 10 percentage points of oracle. Is retrieval doing its job?

If results land below threshold, the published narrative is honest. "v1.1 measured X. Below the 75 percent target. Here is what we changed and what we will fix." Pre-registration is the point.

09 — Metrics

Headline metrics

Violation prevention rate (Layer 2 outcome)
Baseline drift rate (problem severity)
False positive rate (overreach)

Diagnostic metrics

Retrieval recall at 5 (Layer 1)
Retrieval precision at 5 (Layer 1)
Irrelevant injection rate (Layer 1)
End-to-end vs oracle enforcement gap (Layer split)
Ambiguous scenario escalation rate
Determinism variance

Reported but non-graded

Mean tokens per request, governed vs baseline
Mean latency, governed vs baseline

10 — Run protocol

Each scenario runs in three iterations to measure determinism.
Two variants per iteration: baseline (no Mneme) and governed (Mneme end-to-end).
One additional pass per scenario for oracle enforcement (ground-truth rules injected, retrieval bypassed).
Total runs per scenario: 9. Total runs for the full suite at 36 scenarios: 324.
Same model, same temperature, same context window across baseline and governed.
Codebase context loaded from versioned fixture; fixture hashes recorded with results.

11 — What this proves and does not prove

Does not prove

Production incident reduction. (Operational metric, design partner work.)
Developer velocity gains. (Operational metric, design partner work.)
Generalization to codebases outside the suite. (Phase 2.)
Suite representativeness across all real-world architectural rules. (Suite expansion ongoing in v1.2 and beyond.)

Does prove (if thresholds hit)

Under controlled conditions on a 36-scenario suite covering six categories with 28 percent real-incident sourcing and 22 percent controls, Mneme reduces architectural rule violations by at least 75 percent relative to the same model without Mneme, while triggering false positives no more than 10 percent of the time on unrelated changes.

Narrow claim. Pre-registered thresholds. Reproducible.

12 — Reproducibility

Published artifacts:

Full scenario suite as YAML.
Verification functions as Python.
Harness runner as Python.
Exact model versions tested (vendor, model name, snapshot date).
Exact Mneme version and commit hash.
Codebase fixture archive with content hashes.
Raw structured outputs of every run.
Aggregated results with full metric breakdown.
This methodology document, version-tagged.

Anyone clones the repository, pins the same model version, runs the harness, and confirms the numbers within determinism variance. If they cannot, the published numbers are wrong.

13 — Versioning

v1.0. Working draft. Superseded by v1.1.
v1.1. Methodology hardening. Anti-gaming protocol, controls, ambiguous cases, retrieval and enforcement split. Publishable.
v1.2 and beyond. Suite expansion. Public claims always cite the version measured.

Comparing v1.1 results to v1.2 results is meaningful only on the scenario subset that exists unchanged in both. Publications make this explicit.

14 — Limitations

Honest limitations published alongside results:

Synthetic codebases are smaller than production codebases. May overstate baseline ability to track context.
Real-incident scenarios are drawn from two codebases owned by the author. Domain coverage is limited.
One model under test in v1.1. Cross-model generalization is a v1.2 question.
Single-rule scenarios dominate. Multi-rule conflict is limited to four ambiguous cases.
The author also built Mneme. This is a self-evaluation. Phase 2 design partner validation provides external evidence.

15 — Changelog from v1.0 to v1.1

Change	Reason
Suite size 30 to 36	Add 8 control + 4 ambiguous scenarios
Single layer to two layers	Conflating retrieval and enforcement hid real failure modes
Prose verification to structured JSON	Closed verifier-gaming attack
5 categories to 6	Real governance is not always binary
Synthetic-only to mixed sourcing	Synthetic-only is not credible externally
3 thresholds to 4	Oracle gap prevents inflated end-to-end numbers
3 iterations + 1 oracle pass	Required for retrieval and enforcement separation

deterministic reproducible pre-registered structured-output layered metrics controls included

Governance Benchmark v1.1 methodology.

01 — Purpose

02 — Scope

03 — Layered measurement

Layer 1: retrieval

Layer 2: enforcement

04 — Structured output protocol

Response schema

Why this closes gaming

05 — Suite composition

06 — Source mix

07 — Difficulty calibration

08 — Verdict thresholds

09 — Metrics

Headline metrics

Diagnostic metrics

Reported but non-graded

10 — Run protocol

11 — What this proves and does not prove

Does not prove

Does prove (if thresholds hit)

12 — Reproducibility

13 — Versioning

14 — Limitations

15 — Changelog from v1.0 to v1.1

Methodology before metrics.