Mneme's benchmark methodology is deterministic by design. It uses canned LLM responses, fixed retrieval, and rule-text matching so that every PASS is reproducible and every regression is visible. The benchmark is a regression and integrity instrument, not a generalization claim. recall@1 is reported but never optimized. Methodology and shipped suite are anchored to the Layer 1 freeze at commit e73ff7d.
Docs · Methodology Philosophy

Why the benchmark is intentionally constrained.

Most AI infrastructure benchmarks compete on score inflation. This page exists to be honest about what Mneme's benchmark is and is not. The discipline is the differentiator.

Philosophy

LLMs are nondeterministic. A benchmark whose numbers vary between runs cannot detect a regression in the part of the system that Mneme actually owns: retrieval and enforcement against governed decisions. So Mneme's benchmark deliberately collapses the nondeterminism. Canned responses replace live model calls. Retrieval is bag-of-tokens with stable sort and explicit tiebreak. Enforcement is rule-text matching against retrieved decisions. The same memory and the same query produce byte-identical retrieval order on every run.

The job of the benchmark is not to prove that Mneme is smart. It is to prove that every change to retrieval or enforcement is visible and reproducible — that a regression cannot land silently, that a PASS cannot be coincidence, and that the numbers reported externally cannot drift away from what the code actually does.

Deterministic governance

Determinism is not a stylistic choice. It is the contract.

These principles compound. Deterministic retrieval makes regressions visible. Visible regressions make benchmark integrity possible. Benchmark integrity makes governance claims defensible. Take any of them out and the chain breaks.

What the benchmark proves

Three things, narrowly:

What it does not prove

Equally important. The benchmark deliberately does not claim:

Why the benchmark is deterministic

Three concrete mechanisms:

Together, these make the benchmark a true regression instrument. If a number changes between runs, the cause is in the code, not the model.

Why recall@1 is reported but not optimized

This is the part that matters. recall@1 is the sharpest tuning dial under fixed methodology. Promoting it into the suite headline would make tuning weights to move recall@1 a continuous temptation. With seven scenarios and an eleven-decision pool, any further tuning would fit the suite, not the world.

So recall@1 is tracked and reported transparently — and explicitly excluded from pass/fail and any external scorecard. It is anti-overfitting discipline made visible.

Most benchmark cultures pull in the opposite direction. A score is published; the score is optimized; over time the methodology bends toward whatever moves the score. Mneme's freeze refuses that gravity by writing the refusal into the methodology itself: recall@1 is a diagnostic, not a target.

The same logic applies to precision@K and the irrelevant-injection rate. In the current shipped suite both are structurally pinned by the small expected-decision sets and the empty acceptable_decision_ids arrays. The freeze records that openly rather than letting the numbers float as if they were quality signals. They are placeholder telemetry for a future methodology, not evidence of governance quality today.

What Mneme intentionally does not solve

A wedge is only meaningful if its boundary is explicit. The following are not on Mneme's roadmap. They are not "later" — they are not Mneme:

If a feature request maps onto this list, the answer is no — not "later," not "out of scope for now," but not Mneme.

Layer 1 framing

Mneme is currently in Layer 1 — local-repo, single-developer, project-scoped architectural governance. The mechanism is frozen at e73ff7d. The validation phase tests the mechanism in real repos with real users; it does not extend the mechanism. Layer 2 territory (multi-repo, team sync, org policy distribution) opens only after Layer 1 exit criteria are met.

This page is a public summary of the philosophy. The full architectural freeze artifact lives in the repo at docs/architecture/layer1-freeze-e73ff7d.md. The full methodology specification with metric definitions, suite composition, verdict thresholds, and reproducibility protocol lives at /benchmark/.

What this means for evaluators

If you are evaluating Mneme as a design partner, technical buyer, or contributor, the artifacts to look at are:

Mneme is not competing on eval-score inflation, model-performance leaderboards, or coding-benchmark culture. The competition is architectural continuity, governance reliability, reproducibility, and deterministic enforcement. The benchmark methodology exists to make those claims falsifiable.

Read the full freeze artifact.

The repo-level freeze captures shipped capabilities, charter discipline, deferred work, and the bright line between "deferred" and "not Mneme."