01 — Purpose
The Mneme Governance Benchmark validates whether Mneme reduces architectural rule violations in AI-generated code. It answers one question:
When an AI coding agent is asked to make a change in a codebase that has known architectural constraints, does Mneme produce code that respects those constraints more often than the same agent without Mneme, without over-enforcing on changes where the constraints do not apply?
The "without over-enforcing" clause is doing real work. A governance system that catches violations by refusing every change is not helpful, and the original v1.0 draft did not measure this rigorously enough. v1.1 corrects that.
02 — Scope
In scope. Pre-generation governance. The benchmark measures the rate at which the model's generated output respects a given architectural rule when Mneme injects the relevant rule, versus when it does not, separately from whether Mneme retrieved the right rule in the first place.
Out of scope. Post-generation observability, runtime enforcement, security scanning, performance optimization, code style preferences. Those are separate measurements with separate tooling.
03 — Layered measurement
v1.0 measured a single end-to-end outcome. v1.1 separates this into two layers because conflating them hides real product issues.
Layer 1: retrieval
Did Mneme surface the right rules for this prompt? Each scenario specifies which rules from the project's decision store should be retrieved. Mneme's retrieval is run independently and scored against ground truth.
- Recall at K. Of the rules that should have been retrieved, what fraction made it into the top K. K is typically 5.
- Precision at K. Of the rules retrieved into the top K, what fraction were actually relevant.
- Irrelevant injection rate. Fraction of governed runs where Mneme injected at least one rule the scenario marks as irrelevant.
Layer 2: enforcement
Given the right rule was retrieved, did the governed output respect it? Two sub-measurements:
- End-to-end enforcement. Run the governed pipeline end-to-end. Score the output.
- Oracle enforcement. Bypass retrieval. Inject ground-truth rules directly. Score the output.
The delta between end-to-end and oracle enforcement attributes failures to retrieval versus injection-and-judgment.
Why this matters. A user who hits a violation needs to know which layer broke. Without the layer split, a benchmark publishes a single number that hides which part of the system failed. With the layer split, retrieval issues and enforcement issues are diagnosed separately and fixed independently.
04 — Structured output protocol
v1.0 verified by parsing freeform model output. This is gameable: a model can write the right words in prose while violating the rule in code. v1.1 closes this attack surface by requiring the model to emit a structured artifact.
Response schema
The system prompt instructs the model to respond with JSON conforming to this schema. Verifiers operate on the parsed artifact, not on prose.
{
"files_to_modify": [
{ "path": "src/foo.py", "new_content": "..." }
],
"files_to_create": [
{ "path": "src/bar.py", "content": "..." }
],
"files_to_delete": [
{ "path": "src/old.py" }
],
"new_dependencies": [
{ "name": "requests", "version_constraint": ">=2.31" }
],
"rationale": "Free text. Not used for scoring.",
"refused": false,
"refusal_reason": null,
"implementation_plan": [
"Step 1: ...",
"Step 2: ..."
]
}
Verifiers never read rationale or implementation_plan for scoring. Those fields exist for human auditing of borderline cases. The single exception is the verifier for ambiguous scenarios, which IS allowed to inspect rationale because the question being asked is whether the model surfaced ambiguity at all.
If the model refuses, refused: true and refusal_reason is required. A refusal is a candidate respected outcome only if the rule actually applied; on a control scenario, a refusal is a false positive.
Why this closes gaming
A model cannot earn a respected verdict by writing the right words. It must produce code that, when parsed, contains the right files, the right imports, and no forbidden patterns. The verifier inspects facts about the artifact, not claims about it.
05 — Suite composition
36 scenarios across six categories. Controls and ambiguous cases are first-class.
| Category | Count | Purpose |
|---|---|---|
| Architectural violations | 8 | Core architectural rules: extend before rebuild, no parallel systems, layering |
| Scope and boundary | 6 | Module boundaries, API contracts, public surface stability |
| Anti-pattern violations | 6 | Singletons, god objects, hidden globals, banned approaches |
| Dependency and tooling | 4 | Forbidden imports, banned libraries, lockfile changes |
| Ambiguous / borderline | 4 | Multiple rules apply, partial conflicts, unclear scope |
| Control / non-applicable | 8 | Rules exist, prompts unrelated, correct behavior is no intervention |
Controls at 22 percent of the suite. Anything below 20 percent makes false positive rate non-credible.
06 — Source mix
A synthetic-only suite is not credible. v1.1 requires real-incident scenarios.
| Source | Count | Notes |
|---|---|---|
| Synthetic canonical | 12 | Constructed to test specific rules cleanly |
| Real Mneme repo drift | 6 | Drawn from prompts that produced drift in development |
| Real CannabisDealsUS drift | 4 | Same source, different domain |
| Adversarial / edge | 6 | Prompts crafted to look innocent but hit a rule |
| Controls | 8 | Rule exists, prompt unrelated |
Real-incident scenarios are 28 percent of the suite, satisfying the credibility floor.
07 — Difficulty calibration
- Easy (30 percent of non-controls). Baseline LLM violates 20 to 40 percent of the time. Tests that Mneme does not regress easy cases.
- Medium (50 percent of non-controls). Baseline violates 40 to 70 percent. The meaningful scenarios.
- Hard (20 percent of non-controls). Baseline violates above 70 percent. Where Mneme provides the most lift.
Calibration is verified empirically by running the baseline against the suite before publication. If the actual baseline violation rate on a "medium" scenario is 90 percent, it is reclassified as hard. The labels describe observed difficulty, not assumed difficulty.
08 — Verdict thresholds
All four are required. Pre-committed and will not move at decision time.
If results land below threshold, the published narrative is honest. "v1.1 measured X. Below the 75 percent target. Here is what we changed and what we will fix." Pre-registration is the point.
09 — Metrics
Headline metrics
- Violation prevention rate (Layer 2 outcome)
- Baseline drift rate (problem severity)
- False positive rate (overreach)
Diagnostic metrics
- Retrieval recall at 5 (Layer 1)
- Retrieval precision at 5 (Layer 1)
- Irrelevant injection rate (Layer 1)
- End-to-end vs oracle enforcement gap (Layer split)
- Ambiguous scenario escalation rate
- Determinism variance
Reported but non-graded
- Mean tokens per request, governed vs baseline
- Mean latency, governed vs baseline
10 — Run protocol
- Each scenario runs in three iterations to measure determinism.
- Two variants per iteration: baseline (no Mneme) and governed (Mneme end-to-end).
- One additional pass per scenario for oracle enforcement (ground-truth rules injected, retrieval bypassed).
- Total runs per scenario: 9. Total runs for the full suite at 36 scenarios: 324.
- Same model, same temperature, same context window across baseline and governed.
- Codebase context loaded from versioned fixture; fixture hashes recorded with results.
11 — What this proves and does not prove
Does not prove
- Production incident reduction. (Operational metric, design partner work.)
- Developer velocity gains. (Operational metric, design partner work.)
- Generalization to codebases outside the suite. (Phase 2.)
- Suite representativeness across all real-world architectural rules. (Suite expansion ongoing in v1.2 and beyond.)
Does prove (if thresholds hit)
Under controlled conditions on a 36-scenario suite covering six categories with 28 percent real-incident sourcing and 22 percent controls, Mneme reduces architectural rule violations by at least 75 percent relative to the same model without Mneme, while triggering false positives no more than 10 percent of the time on unrelated changes.
Narrow claim. Pre-registered thresholds. Reproducible.
12 — Reproducibility
Published artifacts:
- Full scenario suite as YAML.
- Verification functions as Python.
- Harness runner as Python.
- Exact model versions tested (vendor, model name, snapshot date).
- Exact Mneme version and commit hash.
- Codebase fixture archive with content hashes.
- Raw structured outputs of every run.
- Aggregated results with full metric breakdown.
- This methodology document, version-tagged.
Anyone clones the repository, pins the same model version, runs the harness, and confirms the numbers within determinism variance. If they cannot, the published numbers are wrong.
13 — Versioning
- v1.0. Working draft. Superseded by v1.1.
- v1.1. Methodology hardening. Anti-gaming protocol, controls, ambiguous cases, retrieval and enforcement split. Publishable.
- v1.2 and beyond. Suite expansion. Public claims always cite the version measured.
Comparing v1.1 results to v1.2 results is meaningful only on the scenario subset that exists unchanged in both. Publications make this explicit.
14 — Limitations
Honest limitations published alongside results:
- Synthetic codebases are smaller than production codebases. May overstate baseline ability to track context.
- Real-incident scenarios are drawn from two codebases owned by the author. Domain coverage is limited.
- One model under test in v1.1. Cross-model generalization is a v1.2 question.
- Single-rule scenarios dominate. Multi-rule conflict is limited to four ambiguous cases.
- The author also built Mneme. This is a self-evaluation. Phase 2 design partner validation provides external evidence.
15 — Changelog from v1.0 to v1.1
| Change | Reason |
|---|---|
| Suite size 30 to 36 | Add 8 control + 4 ambiguous scenarios |
| Single layer to two layers | Conflating retrieval and enforcement hid real failure modes |
| Prose verification to structured JSON | Closed verifier-gaming attack |
| 5 categories to 6 | Real governance is not always binary |
| Synthetic-only to mixed sourcing | Synthetic-only is not credible externally |
| 3 thresholds to 4 | Oracle gap prevents inflated end-to-end numbers |
| 3 iterations + 1 oracle pass | Required for retrieval and enforcement separation |