What is deterministic enforcement in governance?

Deterministic enforcement means that a governance check produces the same verdict for the same code, the same decision corpus, and the same context — every time, independent of model state, retrieval variance, or runtime conditions. Same inputs, same verdict. It is the property that makes governance results auditable and comparable across versions and runs.

Why is determinism important for governance auditability?

Governance audits require reproducibility: the same query against the same corpus must produce the same result. If a governance check can produce different verdicts for the same input on different runs, then a CI gate that passed last Tuesday might fail today without any code change — and there's no meaningful way to compare governance history across versions. Determinism is the structural precondition for meaningful audit trails.

Is all of Mneme's pipeline deterministic?

Retrieval and evaluation are deterministic. The LLM inference step is non-deterministic (models have temperature). Mneme addresses this by moving enforcement to the evaluation layer, which inspects structured output assertions against structured constraint records — not natural language compliance. The evaluator's verdict is a function of structured inputs, not model behavior, so determinism is preserved at the enforcement layer even though the model itself is non-deterministic.

How does non-deterministic retrieval undermine governance?

If retrieval is probabilistic — as with embedding-based RAG — the same query might retrieve different decisions on different runs because embedding models are updated, similarity scores drift, and passage selection varies. This means the constraints injected into a governance check vary across runs. A developer who passes a governance check on Monday might fail on Tuesday without changing any code. The governance signal becomes unreliable and impossible to audit meaningfully.

Can you have deterministic enforcement with a non-deterministic LLM?

Yes, if enforcement happens at the evaluation layer, not inside the LLM. Mneme requires the model to produce structured assertions — specific claims about which constraints it followed or violated. The evaluator then checks those assertions against the constraint records deterministically. The model's temperature affects the phrasing of the assertion, but the evaluation logic checks structured fields, not natural language. The verdict is determined by the evaluation function, which is fully deterministic.

Deterministic Enforcement — Mneme HQ Concepts

When an engineer asks "did the governance check pass?" they implicitly assume the question has a stable answer. Pass means pass — the same code, the same rules, the same result. That assumption is the foundation of every useful governance system, from unit tests to CI gates to security scanners. Without it, the question doesn't have a meaningful answer. It has a distribution of possible answers, varying by run conditions.

Deterministic enforcement is the name for the property that makes governance questions have stable answers. It is not a feature or an optimization goal. It is a structural requirement — the minimum bar below which "governance" is better described as "probabilistic suggestion."

What deterministic enforcement actually means

Determinism in the governance context has three components that are often treated as one but require separate treatment:

Retrieval determinism

Same query + same corpus = same ranked decisions. The retrieval layer must return the identical set of constraint records for the identical query, regardless of when the query is made, what model is deployed, or what other queries have been run recently. Retrieval determinism is the precondition for everything downstream — if the retrieval layer varies, all subsequent enforcement varies with it.

Context determinism

Same decisions + same formatting = same injected prompt. Once the constraint records are retrieved, the context builder must produce the identical prompt injection for identical inputs. This includes field ordering, whitespace, and serialization format. If the injected context varies across runs, the model is receiving different instructions for the same governance check.

Evaluation determinism

Same model output + same constraints = same verdict. The evaluator must produce the identical verdict for identical inputs — regardless of when the evaluation runs, which evaluator instance processes it, or what the evaluator's internal state is. Evaluation determinism means the verdict is a pure function of (output, constraint_records) with no hidden state.

Determinism must hold at every layer. A system that achieves retrieval determinism but not evaluation determinism still produces unpredictable verdicts. A system that achieves context determinism but not retrieval determinism is building deterministic context on top of an unstable retrieval foundation. All three components must be deterministic for the governance system to be auditable.

Why this problem exists in AI-native development

The problem is structural: the most natural approaches to AI-native governance introduce non-determinism at exactly the points where determinism is required.

Consider how most teams approach governance retrieval: they embed decision documentation in a vector store and retrieve semantically relevant passages at query time. Embedding-based retrieval is designed to be approximate — it finds passages that are semantically close to the query, using cosine similarity in high-dimensional space. That approximation is the feature. But approximation is another word for non-determinism at the retrieval layer.

The non-determinism compounds across time:

Embedding models are updated and replaced. The same text produces different embeddings after a model update, changing similarity scores and passage rankings silently.
Vector indexes are rebuilt. Index structure affects nearest-neighbor results at the margin — the same query might find a slightly different set of passages depending on when the index was built.
New documents are added to the corpus. Adding a new ADR can change the relative similarity scores of existing documents, causing previously-retrieved passages to drop out of the top-K results.

None of these changes require any modification to the code being governed or the rules being enforced. They are infrastructure events. But they change the governance verdict, silently, for code that hasn't changed. A developer who passed governance check on Monday might fail on Tuesday not because their code violated a rule, but because the embedding model was updated and a relevant passage dropped below the retrieval threshold.

A governance failure that wasn't caused by a code change cannot be meaningfully acted on. It is noise in the governance signal — indistinguishable from a real violation, but caused by retrieval drift. Teams that rely on probabilistic retrieval cannot distinguish genuine violations from retrieval variance in their governance history.

The common misread: confusing compliance with determinism

The most pervasive anti-pattern is building governance systems that rely on model-dependent compliance — where the governance check passes because the model "chose to follow" the injected constraints, not because the system enforced them.

Model-dependent compliance has three failure modes:

RAG retrieval (probabilistic). Constraints reach the model through probabilistic retrieval. If the relevant constraint isn't retrieved, the model is never told about it and cannot comply. The governance check passes by default — a false pass, not a meaningful pass.
Model-dependent evaluation. The evaluator asks the model whether the generated code complied. The model might say yes when it actually violated a constraint, or say no when it didn't. The verdict is a model inference, not a structural check.
Prompt engineering as enforcement. The governance system nudges the model toward compliance through prompt language ("always use PostgreSQL") rather than checking structured output against structured constraints. Nudges are probabilistic — they work most of the time, fail at the margin, and fail silently.

All three approaches introduce non-determinism at critical points in the governance chain. A team that relies on any of them cannot audit their governance history meaningfully — they cannot distinguish "passed because code was compliant" from "passed because the model got lucky."

Approach	Non-determinism source	Deterministic alternative
RAG retrieval	Embedding scores, index state	Fixed-weight keyword scoring
Model compliance	Temperature, context order	Structured output + evaluator
Prompt nudging	Model interpretation	Constraint record + binary verdict
Semantic matching	Embedding model updates	Exact field matching + scope filter

How this fits the AI SDLC

Deterministic enforcement is required at two layers of the governance stack:

Layer 1: Retrieval

At the retrieval layer, determinism means the DecisionRetriever uses fixed-weight keyword scoring over structured decision records — no embedding model, no vector store, no ML in the path. The same query tokenizes to the same token set, scores against the same fields with the same weights, and returns the same ranked decisions. The retrieval is a pure function of (query, corpus). Same inputs, same result, every time.

This is the architectural choice that makes Layer 2 meaningful. If Layer 1 varies, the constraints injected at Layer 2 vary, and the verdict at Layer 2 varies. Determinism at Layer 1 is the foundation.

Layer 2: Evaluation

At the evaluation layer, determinism means the Evaluator checks structured output against structured constraints — not natural language compliance, not model self-assessment. The model is required to produce structured assertions about which constraints it followed. The evaluator checks each assertion against the corresponding constraint record using exact matching on the structured fields. The verdict is a function of (assertions, constraint_records). Same inputs, same verdict.

The model's temperature affects how the assertions are phrased, but the evaluator doesn't check phrasing. It checks the structured fields. A model that writes compliant code but phrases its assertion awkwardly still passes. A model that writes non-compliant code but phrases its assertion confidently still fails. The evaluation is structural, not linguistic.

The non-deterministic LLM is sandwiched between two deterministic layers. Layer 1 deterministically retrieves constraints. Layer 2 deterministically evaluates outputs against them. The model's non-determinism is bounded — it affects what code is generated, not whether the generated code is evaluated correctly.

The benchmark consequence

Deterministic enforcement has a concrete consequence for governance benchmarking: benchmark results are only comparable across versions if enforcement is deterministic.

If a benchmark run shows that pass_rate improved from 0.85 to 1.00 between versions, that improvement must mean the governance system improved — more scenarios where code was generated compliantly. It cannot mean the retrieval layer happened to return better decisions by chance, or that the model happened to be more compliant on this particular run due to temperature variance.

Non-deterministic governance systems cannot produce comparable benchmarks. Every run is a sample from a distribution, and comparing two samples from potentially different distributions doesn't tell you whether the governance system improved. Deterministic governance systems produce comparable benchmarks because the same scenarios, same corpus, and same evaluation logic guarantee that differences in outcomes reflect genuine differences in governance quality.

This is why Mneme's benchmark suite is a merge gate. The gate only has meaning because the enforcement is deterministic: a pass_rate regression is a genuine regression, not a sampling artifact.

What deterministic enforcement actually means

Retrieval determinism

Context determinism

Evaluation determinism

Why this problem exists in AI-native development

The common misread: confusing compliance with determinism

How this fits the AI SDLC

Layer 1: Retrieval

Layer 2: Evaluation

The benchmark consequence

Frequently asked questions

Related reading