When an engineer asks "did the governance check pass?" they implicitly assume the question has a stable answer. Pass means pass — the same code, the same rules, the same result. That assumption is the foundation of every useful governance system, from unit tests to CI gates to security scanners. Without it, the question doesn't have a meaningful answer. It has a distribution of possible answers, varying by run conditions.
Deterministic enforcement is the name for the property that makes governance questions have stable answers. It is not a feature or an optimization goal. It is a structural requirement — the minimum bar below which "governance" is better described as "probabilistic suggestion."
What deterministic enforcement actually means
Determinism in the governance context has three components that are often treated as one but require separate treatment:
Retrieval determinism
Same query + same corpus = same ranked decisions. The retrieval layer must return the identical set of constraint records for the identical query, regardless of when the query is made, what model is deployed, or what other queries have been run recently. Retrieval determinism is the precondition for everything downstream — if the retrieval layer varies, all subsequent enforcement varies with it.
Context determinism
Same decisions + same formatting = same injected prompt. Once the constraint records are retrieved, the context builder must produce the identical prompt injection for identical inputs. This includes field ordering, whitespace, and serialization format. If the injected context varies across runs, the model is receiving different instructions for the same governance check.
Evaluation determinism
Same model output + same constraints = same verdict. The evaluator must produce the identical verdict for identical inputs — regardless of when the evaluation runs, which evaluator instance processes it, or what the evaluator's internal state is. Evaluation determinism means the verdict is a pure function of (output, constraint_records) with no hidden state.
Determinism must hold at every layer. A system that achieves retrieval determinism but not evaluation determinism still produces unpredictable verdicts. A system that achieves context determinism but not retrieval determinism is building deterministic context on top of an unstable retrieval foundation. All three components must be deterministic for the governance system to be auditable.
Why this problem exists in AI-native development
The problem is structural: the most natural approaches to AI-native governance introduce non-determinism at exactly the points where determinism is required.
Consider how most teams approach governance retrieval: they embed decision documentation in a vector store and retrieve semantically relevant passages at query time. Embedding-based retrieval is designed to be approximate — it finds passages that are semantically close to the query, using cosine similarity in high-dimensional space. That approximation is the feature. But approximation is another word for non-determinism at the retrieval layer.
The non-determinism compounds across time:
- Embedding models are updated and replaced. The same text produces different embeddings after a model update, changing similarity scores and passage rankings silently.
- Vector indexes are rebuilt. Index structure affects nearest-neighbor results at the margin — the same query might find a slightly different set of passages depending on when the index was built.
- New documents are added to the corpus. Adding a new ADR can change the relative similarity scores of existing documents, causing previously-retrieved passages to drop out of the top-K results.
None of these changes require any modification to the code being governed or the rules being enforced. They are infrastructure events. But they change the governance verdict, silently, for code that hasn't changed. A developer who passed governance check on Monday might fail on Tuesday not because their code violated a rule, but because the embedding model was updated and a relevant passage dropped below the retrieval threshold.
A governance failure that wasn't caused by a code change cannot be meaningfully acted on. It is noise in the governance signal — indistinguishable from a real violation, but caused by retrieval drift. Teams that rely on probabilistic retrieval cannot distinguish genuine violations from retrieval variance in their governance history.
The common misread: confusing compliance with determinism
The most pervasive anti-pattern is building governance systems that rely on model-dependent compliance — where the governance check passes because the model "chose to follow" the injected constraints, not because the system enforced them.
Model-dependent compliance has three failure modes:
- RAG retrieval (probabilistic). Constraints reach the model through probabilistic retrieval. If the relevant constraint isn't retrieved, the model is never told about it and cannot comply. The governance check passes by default — a false pass, not a meaningful pass.
- Model-dependent evaluation. The evaluator asks the model whether the generated code complied. The model might say yes when it actually violated a constraint, or say no when it didn't. The verdict is a model inference, not a structural check.
- Prompt engineering as enforcement. The governance system nudges the model toward compliance through prompt language ("always use PostgreSQL") rather than checking structured output against structured constraints. Nudges are probabilistic — they work most of the time, fail at the margin, and fail silently.
All three approaches introduce non-determinism at critical points in the governance chain. A team that relies on any of them cannot audit their governance history meaningfully — they cannot distinguish "passed because code was compliant" from "passed because the model got lucky."
| Approach | Non-determinism source | Deterministic alternative |
|---|---|---|
| RAG retrieval | Embedding scores, index state | Fixed-weight keyword scoring |
| Model compliance | Temperature, context order | Structured output + evaluator |
| Prompt nudging | Model interpretation | Constraint record + binary verdict |
| Semantic matching | Embedding model updates | Exact field matching + scope filter |
How this fits the AI SDLC
Deterministic enforcement is required at two layers of the governance stack:
Layer 1: Retrieval
At the retrieval layer, determinism means the DecisionRetriever uses fixed-weight keyword scoring over structured decision records — no embedding model, no vector store, no ML in the path. The same query tokenizes to the same token set, scores against the same fields with the same weights, and returns the same ranked decisions. The retrieval is a pure function of (query, corpus). Same inputs, same result, every time.
This is the architectural choice that makes Layer 2 meaningful. If Layer 1 varies, the constraints injected at Layer 2 vary, and the verdict at Layer 2 varies. Determinism at Layer 1 is the foundation.
Layer 2: Evaluation
At the evaluation layer, determinism means the Evaluator checks structured output against structured constraints — not natural language compliance, not model self-assessment. The model is required to produce structured assertions about which constraints it followed. The evaluator checks each assertion against the corresponding constraint record using exact matching on the structured fields. The verdict is a function of (assertions, constraint_records). Same inputs, same verdict.
The model's temperature affects how the assertions are phrased, but the evaluator doesn't check phrasing. It checks the structured fields. A model that writes compliant code but phrases its assertion awkwardly still passes. A model that writes non-compliant code but phrases its assertion confidently still fails. The evaluation is structural, not linguistic.
The non-deterministic LLM is sandwiched between two deterministic layers. Layer 1 deterministically retrieves constraints. Layer 2 deterministically evaluates outputs against them. The model's non-determinism is bounded — it affects what code is generated, not whether the generated code is evaluated correctly.
The benchmark consequence
Deterministic enforcement has a concrete consequence for governance benchmarking: benchmark results are only comparable across versions if enforcement is deterministic.
If a benchmark run shows that pass_rate improved from 0.85 to 1.00 between versions, that improvement must mean the governance system improved — more scenarios where code was generated compliantly. It cannot mean the retrieval layer happened to return better decisions by chance, or that the model happened to be more compliant on this particular run due to temperature variance.
Non-deterministic governance systems cannot produce comparable benchmarks. Every run is a sample from a distribution, and comparing two samples from potentially different distributions doesn't tell you whether the governance system improved. Deterministic governance systems produce comparable benchmarks because the same scenarios, same corpus, and same evaluation logic guarantee that differences in outcomes reflect genuine differences in governance quality.
This is why Mneme's benchmark suite is a merge gate. The gate only has meaning because the enforcement is deterministic: a pass_rate regression is a genuine regression, not a sampling artifact.