The AI tooling market currently has a strong gravitational pull toward observability. Traces. Dashboards. Drift surfaces. Evaluation pipelines. Agent behavior analytics. Each layer is independently useful — and the ecosystem has moved there first because instrumentation is easier to build than enforcement, and easier to sell than restraint.
But the move has produced a quiet category error. Teams investing heavily in agent observability are interpreting that investment as governance. It is not. Visibility into agent behavior is necessary, but it is not the layer that determines what the agent is allowed to do. Observability is downstream of governance.
The rise of AI observability
The reason the observability layer matured first is structural, not strategic. When AI coding agents began producing meaningful code volume, the immediate need was to understand what they were doing. Traces became the unit of debugging. Token streams became the unit of cost accounting. Drift dashboards became the unit of quality reporting. Evaluation harnesses became the unit of model selection.
This made sense. You cannot improve what you cannot see. Observability tooling — LangSmith, Braintrust, Arize, Phoenix, agent tracing platforms — gave teams a way to inspect the previously opaque behavior of LLM-based agents. The investment was justified, and the category will continue to grow.
What did not happen, in most teams, is the parallel investment in the layer that decides what the agent is allowed to produce in the first place. The result is a stack with a strong observation layer and a missing constraint layer — a microscope without a gate.
Observability is reactive by design
This is the structural argument the rest of the post rests on.
Observability begins after the behavior already occurred.— the core property
That is not a flaw of observability. It is the definition. To observe a system is to record what it did. The observation comes after the action; that ordering is irreducible. A telemetry pipeline can be fast, low-latency, real-time — none of those properties change the temporal ordering. The behavior happened, then it was observed.
For systems where the throughput is human-paced, that ordering is fine. The behavior happens slowly enough that the observation produces signal in time to act. A human engineer writes a function; review observes it; the team makes a decision. The loop closes within hours.
For systems where the throughput is agent-paced, the same ordering fails. An agent writes thousands of lines while the observation pipeline is still processing the first batch. The signal is real; the latency between signal and action is what changes. By the time the team has surfaced a pattern of violations from the dashboard, the codebase has accumulated dozens of additional instances of the same pattern.
Observability does not introduce reactivity. It exposes it. The reactivity was already there — observability is just what makes it legible. The question that follows is whether the layer in front of observability can act before the behavior occurs.
The scaling problem
The reactive ordering is tolerable when the violation rate is low. It is not tolerable when the violation rate is high — and that is exactly the regime agentic development creates.
Three forces compound:
- Generation rate — one engineer with an AI agent produces 10–100x the code of a non-AI baseline. Drift accumulates at the same multiplier.
- Session multiplicity — each cold-started session, each agent runtime, each developer's local workflow produces independent drift. The aggregate drift rate scales with the team's effective parallelism.
- Review capacity — the human layer that observability ultimately feeds remains linear with team size. It does not scale with the agent layer.
When all three operate at once, the observability dashboard becomes a backlog generator. Each row on the drift surface represents a violation that has already been proposed, possibly merged, and is now waiting for someone to act on it. The dashboard grows faster than it shrinks. It becomes evidence of the problem rather than a tool for resolving it.
This is not a tooling failure. It is an architectural one. You cannot fix a queue-depth problem by improving the visualization of the queue. You fix it by reducing the rate at which the queue fills — which means acting upstream of the observation point.
Visibility without enforcement
The clearest way to state the structural gap is to imagine a team with perfect observability and no governance layer.
This team has, by hypothesis, complete visibility into agent behavior. Every tool call is traced. Every diff is scored against architectural rules. Every drift event surfaces on a dashboard within seconds. The team can name, count, and graph every category of violation in their codebase. They can correlate violations with agents, with sessions, with file paths, with developers. The observability stack is maximal.
That team can still ship architectural drift continuously. The agent proposes the violation; the observability layer records the violation; the violation merges or is reverted; the next session proposes the same violation again. The dashboard reports a stable drift rate — not because nothing is being done about it, but because the only thing being done is observing.
You can perfectly observe drift, and continuously ship drift, at the same time.— the failure mode
This is the failure mode that observability-as-governance produces. The team feels in control because they can see everything. The codebase, meanwhile, accumulates exactly the violations the dashboard is reporting. Visibility is not control. Observability without enforcement becomes operational archaeology — a precise catalog of what already happened, with no mechanism to change what happens next.
Governance before generation
The corrective layer sits in front of the agent, not behind it. Architectural decisions, compiled into a typed corpus, are queried by the agent — or imposed on the agent's tool use — at the moment it is about to write code. The agent's proposal is shaped by the constraint before it becomes output that observability would have to surface.
This layer has specific properties that distinguish it from telemetry:
- Pre-generation — it fires at the agent's pre-tool-use hook or in the IDE rules layer, not at the merge gate or after.
- Deterministic — the same query against the same corpus returns the same verdict. The constraint is not a probabilistic suggestion that the agent may or may not follow.
- Propagated — every agent the team uses queries the same compiled corpus. Claude Code, Cursor, Copilot, and CI all reach the same verdict against the same decision.
- Citable — every enforcement event traces back to the authoring ADR. Engineers see which decision blocked them, written by whom, on which date.
Done correctly, the governance layer reduces the rate at which the observability layer has to surface violations. Drift telemetry becomes meaningful again — it surfaces the edge cases that the constraint layer did not catch, rather than reporting an endless stream of preventable proposals. The two layers work in series, not in substitution.
Observability is operational memory. Governance is operational control. A team with both has a feedback loop. A team with only observability has a record. A team with only governance has a gate but no learning. The point is not to choose — it is to put them in the right order.
The four layers of the AI coding stack
The cleanest way to position this is as a layered model. Each layer has a distinct function. None of them are substitutes for each other.
| Layer | Function |
|---|---|
| Governance | Prevents or constrains violations before generation |
| Verification | Confirms architectural invariants hold against proposed output |
| Observability | Detects and explains violations that reached the codebase |
| Review | Human remediation and escalation for the edge cases |
Read top to bottom, each layer compensates for the residual signal of the one above it. Governance handles the bulk of the constraint problem mechanically; verification handles the cases where the constraint applies in a way the governance layer needs to confirm; observability surfaces what slipped through; review escalates the small minority of those that need human judgment.
Read bottom to top, the failure pattern of most current AI coding stacks becomes obvious: they have invested in observability and review and treated those as governance. The top two layers are missing. The lower layers can do their jobs, but the team is paying the full cost of every violation that the missing upstream layers should have prevented.
The future stack
The same layering recast as a forward-looking architecture clarifies how the categories will likely split as the ecosystem matures.
| Layer | Role |
|---|---|
| Generation | Produce code (Claude Code, Cursor, Copilot, JetBrains AI) |
| Governance | Constrain behavior at the moment of intent (Mneme) |
| Observability | Explain outcomes — traces, drift, evaluations (LangSmith, Braintrust, Arize) |
| Review | Escalate edge cases (CodeRabbit, human reviewers) |
The teams currently building the observability layer are doing important work. The category is real. The argument is not that observability tools are unnecessary — it is that they are most valuable when paired with a governance layer that prevents the noise they would otherwise have to surface. The observability signal becomes sharper when most of the routine violations have already been suppressed upstream. The drift dashboard becomes a list of the genuinely interesting cases, not a backlog of preventable patterns.
The complete worldview
Mneme's position is that AI coding requires a governance layer with the same engineering rigor that observability already has. The conceptual framework that follows is consistent:
- Memory is not governance — recall is not constraint. Remembering a decision does not enforce it.
- Prompt engineering is not governance — nudging the agent is not constraining the agent. Prompts shape one generation; governance shapes every generation.
- Review is not governance — catching the violation is not preventing it. Review samples output; governance shapes input.
- Observability is not governance — seeing the violation is not stopping it. Visibility is downstream of control.
Each is a category boundary the ecosystem will eventually settle. The team that wires all four layers in the right order — generation, governance, observability, review — runs a stack where each layer does exactly the work it is suited for and nothing else. That is the configuration that scales.
Everything else is, in the long run, operational archaeology.