Thought Leadership 7 min read · May 2026

Why Observability Is Not Governance

Observability tells you the agent violated architecture. Governance helps prevent the violation from being proposed in the first place. The distinction sounds subtle. At agentic scale, it is the difference between operational control and operational archaeology.

By Theo Valmis · Published May 2026

Post-generation

Observability

Detects and explains violations. Traces, dashboards, drift logs, evaluations. Begins after the behavior already occurred.

Pre-generation

Governance

Prevents or constrains violations. Decision corpus, pre-tool-use hooks, CI gates, verification contracts. Acts at the moment of intent.

The AI tooling market currently has a strong gravitational pull toward observability. Traces. Dashboards. Drift surfaces. Evaluation pipelines. Agent behavior analytics. Each layer is independently useful — and the ecosystem has moved there first because instrumentation is easier to build than enforcement, and easier to sell than restraint.

But the move has produced a quiet category error. Teams investing heavily in agent observability are interpreting that investment as governance. It is not. Visibility into agent behavior is necessary, but it is not the layer that determines what the agent is allowed to do. Observability is downstream of governance.

The rise of AI observability

The reason the observability layer matured first is structural, not strategic. When AI coding agents began producing meaningful code volume, the immediate need was to understand what they were doing. Traces became the unit of debugging. Token streams became the unit of cost accounting. Drift dashboards became the unit of quality reporting. Evaluation harnesses became the unit of model selection.

This made sense. You cannot improve what you cannot see. Observability tooling — LangSmith, Braintrust, Arize, Phoenix, agent tracing platforms — gave teams a way to inspect the previously opaque behavior of LLM-based agents. The investment was justified, and the category will continue to grow.

What did not happen, in most teams, is the parallel investment in the layer that decides what the agent is allowed to produce in the first place. The result is a stack with a strong observation layer and a missing constraint layer — a microscope without a gate.

Observability is reactive by design

This is the structural argument the rest of the post rests on.

Observability begins after the behavior already occurred.— the core property

That is not a flaw of observability. It is the definition. To observe a system is to record what it did. The observation comes after the action; that ordering is irreducible. A telemetry pipeline can be fast, low-latency, real-time — none of those properties change the temporal ordering. The behavior happened, then it was observed.

For systems where the throughput is human-paced, that ordering is fine. The behavior happens slowly enough that the observation produces signal in time to act. A human engineer writes a function; review observes it; the team makes a decision. The loop closes within hours.

For systems where the throughput is agent-paced, the same ordering fails. An agent writes thousands of lines while the observation pipeline is still processing the first batch. The signal is real; the latency between signal and action is what changes. By the time the team has surfaced a pattern of violations from the dashboard, the codebase has accumulated dozens of additional instances of the same pattern.

Observability does not introduce reactivity. It exposes it. The reactivity was already there — observability is just what makes it legible. The question that follows is whether the layer in front of observability can act before the behavior occurs.

The scaling problem

The reactive ordering is tolerable when the violation rate is low. It is not tolerable when the violation rate is high — and that is exactly the regime agentic development creates.

Three forces compound:

Generation rate — one engineer with an AI agent produces 10–100x the code of a non-AI baseline. Drift accumulates at the same multiplier.
Session multiplicity — each cold-started session, each agent runtime, each developer's local workflow produces independent drift. The aggregate drift rate scales with the team's effective parallelism.
Review capacity — the human layer that observability ultimately feeds remains linear with team size. It does not scale with the agent layer.

When all three operate at once, the observability dashboard becomes a backlog generator. Each row on the drift surface represents a violation that has already been proposed, possibly merged, and is now waiting for someone to act on it. The dashboard grows faster than it shrinks. It becomes evidence of the problem rather than a tool for resolving it.

This is not a tooling failure. It is an architectural one. You cannot fix a queue-depth problem by improving the visualization of the queue. You fix it by reducing the rate at which the queue fills — which means acting upstream of the observation point.

Visibility without enforcement

The clearest way to state the structural gap is to imagine a team with perfect observability and no governance layer.

This team has, by hypothesis, complete visibility into agent behavior. Every tool call is traced. Every diff is scored against architectural rules. Every drift event surfaces on a dashboard within seconds. The team can name, count, and graph every category of violation in their codebase. They can correlate violations with agents, with sessions, with file paths, with developers. The observability stack is maximal.

That team can still ship architectural drift continuously. The agent proposes the violation; the observability layer records the violation; the violation merges or is reverted; the next session proposes the same violation again. The dashboard reports a stable drift rate — not because nothing is being done about it, but because the only thing being done is observing.

You can perfectly observe drift, and continuously ship drift, at the same time.— the failure mode

This is the failure mode that observability-as-governance produces. The team feels in control because they can see everything. The codebase, meanwhile, accumulates exactly the violations the dashboard is reporting. Visibility is not control. Observability without enforcement becomes operational archaeology — a precise catalog of what already happened, with no mechanism to change what happens next.

Governance before generation

The corrective layer sits in front of the agent, not behind it. Architectural decisions, compiled into a typed corpus, are queried by the agent — or imposed on the agent's tool use — at the moment it is about to write code. The agent's proposal is shaped by the constraint before it becomes output that observability would have to surface.

This layer has specific properties that distinguish it from telemetry:

Pre-generation — it fires at the agent's pre-tool-use hook or in the IDE rules layer, not at the merge gate or after.
Deterministic — the same query against the same corpus returns the same verdict. The constraint is not a probabilistic suggestion that the agent may or may not follow.
Propagated — every agent the team uses queries the same compiled corpus. Claude Code, Cursor, Copilot, and CI all reach the same verdict against the same decision.
Citable — every enforcement event traces back to the authoring ADR. Engineers see which decision blocked them, written by whom, on which date.

Done correctly, the governance layer reduces the rate at which the observability layer has to surface violations. Drift telemetry becomes meaningful again — it surfaces the edge cases that the constraint layer did not catch, rather than reporting an endless stream of preventable proposals. The two layers work in series, not in substitution.

Observability is operational memory. Governance is operational control. A team with both has a feedback loop. A team with only observability has a record. A team with only governance has a gate but no learning. The point is not to choose — it is to put them in the right order.

The four layers of the AI coding stack

The cleanest way to position this is as a layered model. Each layer has a distinct function. None of them are substitutes for each other.

Layer	Function
Governance	Prevents or constrains violations before generation
Verification	Confirms architectural invariants hold against proposed output
Observability	Detects and explains violations that reached the codebase
Review	Human remediation and escalation for the edge cases

Read top to bottom, each layer compensates for the residual signal of the one above it. Governance handles the bulk of the constraint problem mechanically; verification handles the cases where the constraint applies in a way the governance layer needs to confirm; observability surfaces what slipped through; review escalates the small minority of those that need human judgment.

Read bottom to top, the failure pattern of most current AI coding stacks becomes obvious: they have invested in observability and review and treated those as governance. The top two layers are missing. The lower layers can do their jobs, but the team is paying the full cost of every violation that the missing upstream layers should have prevented.

The future stack

The same layering recast as a forward-looking architecture clarifies how the categories will likely split as the ecosystem matures.

Layer	Role
Generation	Produce code (Claude Code, Cursor, Copilot, JetBrains AI)
Governance	Constrain behavior at the moment of intent (Mneme)
Observability	Explain outcomes — traces, drift, evaluations (LangSmith, Braintrust, Arize)
Review	Escalate edge cases (CodeRabbit, human reviewers)

The teams currently building the observability layer are doing important work. The category is real. The argument is not that observability tools are unnecessary — it is that they are most valuable when paired with a governance layer that prevents the noise they would otherwise have to surface. The observability signal becomes sharper when most of the routine violations have already been suppressed upstream. The drift dashboard becomes a list of the genuinely interesting cases, not a backlog of preventable patterns.

The complete worldview

Mneme's position is that AI coding requires a governance layer with the same engineering rigor that observability already has. The conceptual framework that follows is consistent:

Memory is not governance — recall is not constraint. Remembering a decision does not enforce it.
Prompt engineering is not governance — nudging the agent is not constraining the agent. Prompts shape one generation; governance shapes every generation.
Review is not governance — catching the violation is not preventing it. Review samples output; governance shapes input.
Observability is not governance — seeing the violation is not stopping it. Visibility is downstream of control.

Each is a category boundary the ecosystem will eventually settle. The team that wires all four layers in the right order — generation, governance, observability, review — runs a stack where each layer does exactly the work it is suited for and nothing else. That is the configuration that scales.

Everything else is, in the long run, operational archaeology.

Frequently asked questions

What is the difference between observability and governance for AI coding agents?

Observability detects and explains violations after they have already been proposed or merged — it produces traces, dashboards, drift logs, and evaluations of agent behavior. Governance prevents or constrains those violations before code is generated, by enforcing machine-evaluable architectural decisions at the agent's pre-tool-use hook and at the CI merge gate. Observability is downstream of governance. Visibility tells you what happened; control determines what happens.

Why isn't AI agent telemetry sufficient on its own?

Telemetry is reactive by design. It begins after the behavior has already occurred. At agentic scale — where a single team generates 10–100x the code of human-paced development — telemetry that surfaces violations after generation accumulates a review backlog that grows faster than the team can clear it. Observability without an enforcement layer in front of it converts every violation into rework, not prevention.

Does this mean observability tools are useless?

No. Observability is essential — it is how teams measure drift, audit enforcement, and improve governance over time. But it is operational memory, not operational control. Observability records what the system did; governance shapes what the system is allowed to do. The argument here is not that observability is wrong. It is that observability without an upstream enforcement layer is operational archaeology — you can perfectly observe drift and continuously ship drift at the same time.

How do governance and observability fit together in the AI coding stack?

They sit at different layers of the same pipeline. Generation produces code. Governance constrains what the agent is allowed to produce. Verification confirms invariants hold. Observability explains outcomes after the fact. Review escalates edge cases that none of the prior layers resolved. Each layer has a distinct job — and conflating them is what produces the current pattern of teams investing in dashboards while still shipping architectural drift at AI velocity.

How does this relate to LangSmith, Braintrust, Arize, and similar tools?

AI observability tools — LangSmith, Braintrust, Arize, Phoenix, agent tracing platforms — sit at the observability layer of the stack. They are valuable for measuring agent behavior, identifying regressions, and improving prompts. They are not architectural enforcement layers. They become most useful when paired with a governance layer that prevents the violations they would otherwise have to surface — making the observability signal less noisy and more focused on the edge cases that genuinely need human attention.