Benchmark Methodology Philosophy

Philosophy

LLMs are nondeterministic. A benchmark whose numbers vary between runs cannot detect a regression in the part of the system that Mneme actually owns: retrieval and enforcement against governed decisions. So Mneme's benchmark deliberately collapses the nondeterminism. Canned responses replace live model calls. Retrieval is bag-of-tokens with stable sort and explicit tiebreak. Enforcement is rule-text matching against retrieved decisions. The same memory and the same query produce byte-identical retrieval order on every run.

The job of the benchmark is not to prove that Mneme is smart. It is to prove that every change to retrieval or enforcement is visible and reproducible — that a regression cannot land silently, that a PASS cannot be coincidence, and that the numbers reported externally cannot drift away from what the code actually does.

Deterministic governance

Determinism is not a stylistic choice. It is the contract.

Same inputs, same outputs. Memory plus query produce identical retrieval order, byte-for-byte, on every run.
No vector magic. The retriever is keyword-overlap with documented field weights. There are no embeddings in Layer 1.
No auto-learning. Mneme does not adjust weights or update its own configuration based on what it observes.
No passive ingestion. Memory is edited deliberately and reviewed under the [memory] PR convention. Mneme does not watch the repo or learn from commits.
Explicit recorded decisions only. Every enforced rule is a Decision in project_memory.json with an explicit id. No implicit policy.

These principles compound. Deterministic retrieval makes regressions visible. Visible regressions make benchmark integrity possible. Benchmark integrity makes governance claims defensible. Take any of them out and the chain breaks.

What the benchmark proves

Three things, narrowly:

Deterministic retrieval. Given a fixed memory and a fixed query, the same decisions surface in the same order every run, and the same decisions reach the enforcer.
Governance enforcement. When a recorded decision is retrieved into the top-K, the enforcer detects violations of that decision in candidate output. Every PASS records which rule fired, which term triggered it, and which decision was responsible.
Reproducible decision continuity. A reviewer can re-run the suite locally and reconstruct any verdict from the recorded artifacts.

What it does not prove

Equally important. The benchmark deliberately does not claim:

General intelligence. Mneme is not a reasoning system. The benchmark does not measure model quality, it measures whether recorded decisions are honored.
Autonomous reasoning. No agent loops, no tool-use orchestration, no multi-step planning. Out of scope.
Production readiness at enterprise scale. The shipped suite is small (seven scenarios at the Layer 1 freeze), and the decision pool is deliberately constrained. The benchmark is a regression instrument, not a distribution claim.
Enterprise governance. Multi-team, multi-repo, org-wide policy distribution are Layer 2 territory. Not measured here.
Generalization to arbitrary codebases. Real-world coverage is the work of design-partner validation, not the benchmark.

Why the benchmark is deterministic

Three concrete mechanisms:

Canned responses. Each scenario ships a with_mneme and without_mneme output (text and structured JSON). The benchmark does not call a live model. Run-to-run model variance cannot leak into verdicts.
Stable retrieval. The retriever uses Python's stable sort with insertion-order tiebreak, pinned by regression test. Tied scores resolve identically every run.
Reproducible enforcement. The enforcer matches anti-pattern terms against output by word boundary. Severity (FAIL vs. WARN) is a pure function of the rule text and the input. No probabilistic scoring.

Together, these make the benchmark a true regression instrument. If a number changes between runs, the cause is in the code, not the model.

Why recall@1 is reported but not optimized

This is the part that matters. recall@1 is the sharpest tuning dial under fixed methodology. Promoting it into the suite headline would make tuning weights to move recall@1 a continuous temptation. With seven scenarios and an eleven-decision pool, any further tuning would fit the suite, not the world.

So recall@1 is tracked and reported transparently — and explicitly excluded from pass/fail and any external scorecard. It is anti-overfitting discipline made visible.

Most benchmark cultures pull in the opposite direction. A score is published; the score is optimized; over time the methodology bends toward whatever moves the score. Mneme's freeze refuses that gravity by writing the refusal into the methodology itself: recall@1 is a diagnostic, not a target.

The same logic applies to precision@K and the irrelevant-injection rate. In the current shipped suite both are structurally pinned by the small expected-decision sets and the empty acceptable_decision_ids arrays. The freeze records that openly rather than letting the numbers float as if they were quality signals. They are placeholder telemetry for a future methodology, not evidence of governance quality today.

What Mneme intentionally does not solve

A wedge is only meaningful if its boundary is explicit. The following are not on Mneme's roadmap. They are not "later" — they are not Mneme:

Generalized agent memory. Mneme is not a vector store, not a conversational memory, not an "AI memory" product.
Autonomous planning. No multi-step agent loops, no tool-use orchestration.
Prompt optimization. Mneme does not rewrite prompts to be "better"; it blocks ones that violate governance.
Long-term conversational memory. Not a chat history system.
Enterprise workflow orchestration. Not a workflow engine.
Deployment governance, runtime observability. Not an APM, not a release-pipeline policy tool.
Code-generation quality scoring. Mneme does not rate the quality of generated code; it checks whether generation violated a recorded decision.
Auto-fixing code. Mneme does not edit code. It blocks. The human or the model fixes.

If a feature request maps onto this list, the answer is no — not "later," not "out of scope for now," but not Mneme.

Layer 1 framing

Mneme is currently in Layer 1 — local-repo, single-developer, project-scoped architectural governance. The mechanism is frozen at e73ff7d. The validation phase tests the mechanism in real repos with real users; it does not extend the mechanism. Layer 2 territory (multi-repo, team sync, org policy distribution) opens only after Layer 1 exit criteria are met.

This page is a public summary of the philosophy. The full architectural freeze artifact lives in the repo at docs/architecture/layer1-freeze-e73ff7d.md. The full methodology specification with metric definitions, suite composition, verdict thresholds, and reproducibility protocol lives at /benchmark/.

What this means for evaluators

If you are evaluating Mneme as a design partner, technical buyer, or contributor, the artifacts to look at are:

The discipline shown by the freeze — refusing to promote recall@1, pinning the tie-order, splitting symmetry into its own PR.
The auditability of every PASS — which decision matched, which rule triggered, which term in the input fired it.
The reproducibility of the suite — same memory plus same query yields byte-identical retrieval order.

Mneme is not competing on eval-score inflation, model-performance leaderboards, or coding-benchmark culture. The competition is architectural continuity, governance reliability, reproducibility, and deterministic enforcement. The benchmark methodology exists to make those claims falsifiable.

Why the benchmark is intentionally constrained.