How does Mneme retrieve decisions for a query?

Mneme uses deterministic keyword scoring over structured decision records. The query is tokenized, then scored against each decision's fields (title, tags, content, constraint) using a fixed weight table. Tag matches receive an additive boost. The top-K decisions by score are selected — K defaults to 3. Same query, same memory file, same ranked result every time. There is no vector store, no embedding model, and no ML in the retrieval path.

What is the default K value in Mneme's retrieval?

K equals 3. This is the DEFAULT_MAX_DECISIONS constant that governs both the runtime retrieval and the benchmark harness. Changing K requires a methodology ADR — it is not a tuning parameter.

What is WEAK_RETRIEVAL in Mneme?

WEAK_RETRIEVAL is a scenario verdict emitted when an expected decision ID was not found in the top-K retrieved results. It downgrades a scenario from PASS to WEAK_RETRIEVAL, signaling that the enforcement chain would have failed if that query reached the LLM without the required decision in context. It is a Layer 1 failure surfaced at Layer 2.

Does Mneme use embeddings or vector similarity?

No. Mneme's retrieval is entirely embedding-free. It uses a deterministic weighted keyword scorer over structured fields. This is a deliberate architectural choice: governance requires deterministic results for auditability, and embedding-based retrieval is probabilistic by nature.

What are the five stages of Mneme's pipeline?

The five stages are: (1) MemoryStore — loads and parses project_memory.json into typed Decision records; (2) DecisionRetriever — scores and ranks decisions against the query using deterministic keyword scoring; (3) ContextBuilder — formats the top-K decisions for injection; (4) LLMAdapter — injects context into the model prompt; (5) Evaluator — checks the model's output against the injected decisions and emits a PASS, FAIL, WEAK, WEAK_RETRIEVAL, or MALFORMED verdict.

How Retrieval Works in Mneme — Architecture Deep Dive

Architectural governance for AI coding agents has a retrieval problem at its core: when an AI agent is about to generate code, the governance system needs to surface the correct decisions from a potentially large corpus in real time, before generation completes. That retrieval must be fast, correct, and — critically — reproducible.

Mneme solves this with a deterministic keyword scorer over structured decision records. No probability, no approximation, no model inference in the retrieval path. The same query against the same memory file always returns the same ranked list.

The five-stage pipeline

Every governance check in Mneme runs through five sequential stages. Understanding these stages is the prerequisite for understanding both the retrieval mechanics and the evaluation layer that uses the retrieved results.

Fig. 1 — Mneme's five-stage governance pipeline. Layer 1 (retrieval) and Layer 2 (verdict) are distinct evaluation surfaces with separate metrics.

Stage 1: MemoryStore — loading the decision corpus

Every Mneme run begins by loading project_memory.json into a MemoryStore. The store deserializes the JSON into typed records. The retrieval-eligible pool is the subset of records whose type is Decision — which includes native decision items and any items migrated to decision type (typically rule and anti_pattern records).

Items of type preference, fact, and example are not eligible for retrieval without an explicit migration step — this is intentional. Retrieval pool composition is a methodology decision, not a tuning parameter.

The store also enforces ID uniqueness at load time: if two records share an ID, the first-seen record wins and the duplicate is discarded. This determinism property is tested and enforced.

Stage 2: DecisionRetriever — scoring and ranking

The retriever is the core of Mneme's approach. It answers one question: given this query string, which decisions are most relevant? It answers that question deterministically, using a weighted keyword scorer.

Tokenization

The query is first tokenized into a set of lowercase terms. The tokenizer filters out stopwords, applies a minimum length floor, strips punctuation, and optionally applies stemming. The same tokenizer is applied to each field of each candidate decision record. Tokenization is deterministic: the same input string always produces the same token set.

Field scoring

Each decision record has several scored fields. Token overlap between the query tokens and each field's token set is computed, then multiplied by that field's weight:

# _WEIGHTS — per-field scoring coefficients
title:      3.0   # highest — title is the authoritative description
tags:       2.5   # high — curated signals, explicit match intent
constraint: 1.5   # medium — the enforced rule text
content:    1.0   # baseline — rationale / context prose

A record's raw score is the sum of weighted overlap across all fields. Tag matching also receives an additive flat boost per matching tag (on top of the weight-scaled overlap), which rewards decisions that were explicitly curated for a domain.

Top-K selection

After all candidate decisions are scored, the results are sorted descending by score. The top K=3 are returned as the retrieval output. If two records have identical scores, tie-breaking is deterministic — same order in the memory file, same tie-break result.

Why K=3? Three decisions fit within a model's active attention without crowding the task prompt. Benchmarking showed that recall@3 covers all five governed scenarios at 100% with the current decision pool. K is pinned — changing it requires a methodology ADR because it affects benchmark comparability.

The retrieval output is a ranked list of decision records with their scores and field-level match evidence. This output feeds directly into the ContextBuilder.

Fig. 2 — DecisionRetriever scoring: token overlap against each field, weighted, summed to a raw score, then top-K selection.

Stage 3: ContextBuilder — formatting for injection

The ContextBuilder takes the ranked list from the retriever and converts it into a prompt-ready string. Each retrieved decision is serialized with its ID, title, constraint text, and tags. The formatting is consistent and schema-governed — the LLM sees the same structure for every governance check, which improves constraint adherence.

The ContextBuilder does not filter or re-rank the decisions it receives. Retrieval ordering is preserved. Any post-retrieval modification would introduce non-determinism into the injection path, which is explicitly not allowed.

Stage 4: LLMAdapter — context injection

The LLMAdapter wraps the underlying model call. It takes the formatted context from the ContextBuilder, combines it with the original task prompt, and calls the configured LLM. The governance context is injected as system-level instruction — before the task description, with explicit language identifying the decisions as authoritative constraints.

The adapter is model-agnostic. Any model behind an OpenAI-compatible endpoint works without changes to the retrieval or evaluation stages.

Stage 5: Evaluator — the verdict layer

The Evaluator inspects the model's output against the injected decisions and emits a structured verdict. This is Layer 2 evaluation — separate from the retrieval metrics computed at Layer 1. The five possible verdicts are:

PASS

Output respects all injected constraints

FAIL

Output violates one or more constraints

WEAK

Partial compliance or soft violation

WEAK_RETR.

Expected decision not in top-K — Layer 1 failure

MALFORMED

Output could not be parsed against the schema

WEAK_RETRIEVAL is the critical verdict for understanding retrieval quality. It fires when the evaluator expected a specific decision ID to be present in the injected context, but that ID was not in the top-K retrieved results. This means the model was never given the constraint it was supposed to enforce — a retrieval failure that causes an enforcement gap.

WEAK_RETRIEVAL is a Layer 1 failure surfaced at Layer 2. The evaluator sees that recall dropped — the expected decision ID wasn't retrieved — and downgrades the scenario from PASS. A zero WEAK_RETRIEVAL count across the benchmark suite is a merge-gate requirement.

Layer 1 and Layer 2 metrics

The pipeline produces two distinct sets of metrics, each with different authority:

Metric	Layer	Role
recall@K per scenario	Layer 1	Regression guard — drop below 1.0 blocks merge
recall@1 per scenario	Layer 1	Sharpest tuning signal — rank-1 precision
precision@K (suite mean)	Layer 1	Advisory — fixture-shape constrained at K=3
irrelevant_injection_rate	Layer 1	Advisory — mechanically 1.0 given current fixture shape
pass_rate	Layer 2	Authoritative merge gate — must remain 1.00
WEAK_RETRIEVAL count	Layer 2	Authoritative — must remain 0

Recall@K is the primary regression guard because it directly measures whether the expected decision survived the K=3 cutoff. At the current benchmark suite shape (one expected ID per governed scenario), recall@K can only be 0.0 or 1.0 — it has no headroom upward and can only regress. That makes it an ideal invariant.

Why no embeddings?

The decision to use deterministic keyword scoring instead of embedding-based retrieval is architectural, not just pragmatic. Governance enforcement requires three properties that embedding models cannot provide:

Reproducibility. Governance audits require that the same query against the same corpus returns the same decisions. Embedding models are updated, fine-tuned, and replaced — any change silently changes retrieval behavior.
Explainability. When a decision is retrieved, an engineer must be able to understand why. Keyword overlap is transparent: the title matched two of the query tokens at weight 3.0, and the tags matched one at weight 2.5. Cosine similarity in a 1536-dimensional space is not.
No runtime dependencies. A governance system that requires an embedding API call in the critical path adds latency, failure modes, and cost to every AI-assisted edit. Mneme's retrieval is pure Python, runs locally, and has no network dependencies.

RAG retrieves knowledge. Mneme operationalizes decisions. Embedding-based retrieval is optimized for semantic relevance — finding passages that are topically close to a query. Governance retrieval needs to find the authoritative decision that applies to the current file and task, not the most semantically similar passage. These are different optimization targets.

Scope-aware retrieval

Each decision record can carry a scope field — a glob pattern that restricts which file paths the decision applies to. When a query is associated with a specific file path (e.g. a Claude Code hook fires when editing services/payments/handler.py), the retriever applies scope patterns as a pre-filter before scoring.

This scope matching is structural, not semantic. A decision scoped to services/payments/** fires only for files matching that pattern, regardless of whether the query text is semantically related to payments. This is the correct behavior: a payments-specific constraint should not fire when the model is editing an analytics pipeline, even if the query text happens to overlap.

Determinism guarantee

Determinism is a non-negotiable invariant: same query, same memory file, same ranked result. It is tested at the unit level (token scorer), the integration level (full pipeline), and the suite level (benchmark scenarios). Any change to tie-breaking behavior requires a pinned test before the change ships — the test documents the new contract and prevents silent drift.

This determinism is what makes Mneme's governance auditable. Every CI gate, every hook block, every FAIL verdict traces back to a specific decision record with a specific field match. The trace is reproducible.

The five-stage pipeline

Stage 1: MemoryStore — loading the decision corpus

Stage 2: DecisionRetriever — scoring and ranking

Tokenization

Field scoring

Top-K selection

Stage 3: ContextBuilder — formatting for injection

Stage 4: LLMAdapter — context injection

Stage 5: Evaluator — the verdict layer

Layer 1 and Layer 2 metrics

Why no embeddings?

Scope-aware retrieval

Determinism guarantee

Frequently asked questions

Related reading