Architectural governance for AI coding agents has a retrieval problem at its core: when an AI agent is about to generate code, the governance system needs to surface the correct decisions from a potentially large corpus in real time, before generation completes. That retrieval must be fast, correct, and — critically — reproducible.

Mneme solves this with a deterministic keyword scorer over structured decision records. No probability, no approximation, no model inference in the retrieval path. The same query against the same memory file always returns the same ranked list.

The five-stage pipeline

Every governance check in Mneme runs through five sequential stages. Understanding these stages is the prerequisite for understanding both the retrieval mechanics and the evaluation layer that uses the retrieved results.

MemoryStore loads + parses project_memory.json DecisionRetriever keyword score + tag boost → top-K=3 ContextBuilder formats decisions for prompt LLMAdapter injects context calls model Evaluator PASS / FAIL WEAK_RETRIEVAL MALFORMED query (file path + task) verdict + trace LAYER 1 — RETRIEVAL LAYER 2 — VERDICT project_memory.json
Fig. 1 — Mneme's five-stage governance pipeline. Layer 1 (retrieval) and Layer 2 (verdict) are distinct evaluation surfaces with separate metrics.

Stage 1: MemoryStore — loading the decision corpus

Every Mneme run begins by loading project_memory.json into a MemoryStore. The store deserializes the JSON into typed records. The retrieval-eligible pool is the subset of records whose type is Decision — which includes native decision items and any items migrated to decision type (typically rule and anti_pattern records).

Items of type preference, fact, and example are not eligible for retrieval without an explicit migration step — this is intentional. Retrieval pool composition is a methodology decision, not a tuning parameter.

The store also enforces ID uniqueness at load time: if two records share an ID, the first-seen record wins and the duplicate is discarded. This determinism property is tested and enforced.

Stage 2: DecisionRetriever — scoring and ranking

The retriever is the core of Mneme's approach. It answers one question: given this query string, which decisions are most relevant? It answers that question deterministically, using a weighted keyword scorer.

Tokenization

The query is first tokenized into a set of lowercase terms. The tokenizer filters out stopwords, applies a minimum length floor, strips punctuation, and optionally applies stemming. The same tokenizer is applied to each field of each candidate decision record. Tokenization is deterministic: the same input string always produces the same token set.

Field scoring

Each decision record has several scored fields. Token overlap between the query tokens and each field's token set is computed, then multiplied by that field's weight:

# _WEIGHTS — per-field scoring coefficients title: 3.0 # highest — title is the authoritative description tags: 2.5 # high — curated signals, explicit match intent constraint: 1.5 # medium — the enforced rule text content: 1.0 # baseline — rationale / context prose

A record's raw score is the sum of weighted overlap across all fields. Tag matching also receives an additive flat boost per matching tag (on top of the weight-scaled overlap), which rewards decisions that were explicitly curated for a domain.

Top-K selection

After all candidate decisions are scored, the results are sorted descending by score. The top K=3 are returned as the retrieval output. If two records have identical scores, tie-breaking is deterministic — same order in the memory file, same tie-break result.

Why K=3? Three decisions fit within a model's active attention without crowding the task prompt. Benchmarking showed that recall@3 covers all five governed scenarios at 100% with the current decision pool. K is pinned — changing it requires a methodology ADR because it affects benchmark comparability.

The retrieval output is a ranked list of decision records with their scores and field-level match evidence. This output feeds directly into the ContextBuilder.

query → token set title overlap × 3.0 tags overlap × 2.5 + flat boost constraint overlap × 1.5 content overlap × 1.0 raw score per record sort ↓ → top K=3
Fig. 2 — DecisionRetriever scoring: token overlap against each field, weighted, summed to a raw score, then top-K selection.

Stage 3: ContextBuilder — formatting for injection

The ContextBuilder takes the ranked list from the retriever and converts it into a prompt-ready string. Each retrieved decision is serialized with its ID, title, constraint text, and tags. The formatting is consistent and schema-governed — the LLM sees the same structure for every governance check, which improves constraint adherence.

The ContextBuilder does not filter or re-rank the decisions it receives. Retrieval ordering is preserved. Any post-retrieval modification would introduce non-determinism into the injection path, which is explicitly not allowed.

Stage 4: LLMAdapter — context injection

The LLMAdapter wraps the underlying model call. It takes the formatted context from the ContextBuilder, combines it with the original task prompt, and calls the configured LLM. The governance context is injected as system-level instruction — before the task description, with explicit language identifying the decisions as authoritative constraints.

The adapter is model-agnostic. Any model behind an OpenAI-compatible endpoint works without changes to the retrieval or evaluation stages.

Stage 5: Evaluator — the verdict layer

The Evaluator inspects the model's output against the injected decisions and emits a structured verdict. This is Layer 2 evaluation — separate from the retrieval metrics computed at Layer 1. The five possible verdicts are:

PASS
Output respects all injected constraints
FAIL
Output violates one or more constraints
WEAK
Partial compliance or soft violation
WEAK_RETR.
Expected decision not in top-K — Layer 1 failure
MALFORMED
Output could not be parsed against the schema

WEAK_RETRIEVAL is the critical verdict for understanding retrieval quality. It fires when the evaluator expected a specific decision ID to be present in the injected context, but that ID was not in the top-K retrieved results. This means the model was never given the constraint it was supposed to enforce — a retrieval failure that causes an enforcement gap.

WEAK_RETRIEVAL is a Layer 1 failure surfaced at Layer 2. The evaluator sees that recall dropped — the expected decision ID wasn't retrieved — and downgrades the scenario from PASS. A zero WEAK_RETRIEVAL count across the benchmark suite is a merge-gate requirement.

Layer 1 and Layer 2 metrics

The pipeline produces two distinct sets of metrics, each with different authority:

Metric Layer Role
recall@K per scenario Layer 1 Regression guard — drop below 1.0 blocks merge
recall@1 per scenario Layer 1 Sharpest tuning signal — rank-1 precision
precision@K (suite mean) Layer 1 Advisory — fixture-shape constrained at K=3
irrelevant_injection_rate Layer 1 Advisory — mechanically 1.0 given current fixture shape
pass_rate Layer 2 Authoritative merge gate — must remain 1.00
WEAK_RETRIEVAL count Layer 2 Authoritative — must remain 0

Recall@K is the primary regression guard because it directly measures whether the expected decision survived the K=3 cutoff. At the current benchmark suite shape (one expected ID per governed scenario), recall@K can only be 0.0 or 1.0 — it has no headroom upward and can only regress. That makes it an ideal invariant.

Why no embeddings?

The decision to use deterministic keyword scoring instead of embedding-based retrieval is architectural, not just pragmatic. Governance enforcement requires three properties that embedding models cannot provide:

  1. Reproducibility. Governance audits require that the same query against the same corpus returns the same decisions. Embedding models are updated, fine-tuned, and replaced — any change silently changes retrieval behavior.
  2. Explainability. When a decision is retrieved, an engineer must be able to understand why. Keyword overlap is transparent: the title matched two of the query tokens at weight 3.0, and the tags matched one at weight 2.5. Cosine similarity in a 1536-dimensional space is not.
  3. No runtime dependencies. A governance system that requires an embedding API call in the critical path adds latency, failure modes, and cost to every AI-assisted edit. Mneme's retrieval is pure Python, runs locally, and has no network dependencies.

RAG retrieves knowledge. Mneme operationalizes decisions. Embedding-based retrieval is optimized for semantic relevance — finding passages that are topically close to a query. Governance retrieval needs to find the authoritative decision that applies to the current file and task, not the most semantically similar passage. These are different optimization targets.

Scope-aware retrieval

Each decision record can carry a scope field — a glob pattern that restricts which file paths the decision applies to. When a query is associated with a specific file path (e.g. a Claude Code hook fires when editing services/payments/handler.py), the retriever applies scope patterns as a pre-filter before scoring.

This scope matching is structural, not semantic. A decision scoped to services/payments/** fires only for files matching that pattern, regardless of whether the query text is semantically related to payments. This is the correct behavior: a payments-specific constraint should not fire when the model is editing an analytics pipeline, even if the query text happens to overlap.

Determinism guarantee

Determinism is a non-negotiable invariant: same query, same memory file, same ranked result. It is tested at the unit level (token scorer), the integration level (full pipeline), and the suite level (benchmark scenarios). Any change to tie-breaking behavior requires a pinned test before the change ships — the test documents the new contract and prevents silent drift.

This determinism is what makes Mneme's governance auditable. Every CI gate, every hook block, every FAIL verdict traces back to a specific decision record with a specific field match. The trace is reproducible.