The KPI confusion
Most teams now measure autonomous development with numbers that were already controversial before AI existed. Tokens consumed. Pull requests opened. Lines of code generated. Agent tasks completed. Percentage of a backlog closed by a coding agent. These are the dashboards going up in engineering reviews this quarter, and they share one property: none of them prove the engineering system actually got better.
This is an old confusion wearing new clothes. The industry spent a decade learning that activity is not productivity — that counting commits and lines rewards motion, not outcomes. Agentic tooling did not resolve that lesson. It amplified it. When a machine can open forty pull requests in an afternoon, activity metrics stop being merely unhelpful and start being actively misleading, because the volume looks like progress.
AI has made engineering activity easier to measure while making engineering outcomes harder to reason about. The instrumentation got cheaper at exactly the moment the thing worth instrumenting got more complicated.
The right response is not to invent a fifth activity counter. It is to be precise about what each layer of measurement can and cannot tell you — and to notice where the existing layers go silent.
Why DORA still matters
DORA — DevOps Research and Assessment, the research program behind the annual State of DevOps reports and the 2018 book Accelerate by Nicole Forsgren, Jez Humble, and Gene Kim, now part of Google Cloud — got the hardest part right. It refused to measure individual developers and instead measured the delivery system they work inside.
The four key DORA metrics are well known:
- Deployment frequency — how often the team ships to production (throughput).
- Lead time for changes — time from commit to running in production (throughput).
- Change failure rate — the share of deployments that cause a failure (stability).
- Time to restore service — how fast the team recovers when one does (stability).
DORA later added a fifth metric, reliability, in 2021, and has since refined its own terminology — time to restore service is now framed as failed deployment recovery time. The classic four remain the set most engineering leaders recognize.
The pairing is the genius of the framework. Throughput and stability are tracked together so that teams are not rewarded for trading one against the other. DORA’s repeatedly validated finding is that speed and stability are not a tradeoff: elite performers score well on both at once, while low performers score poorly on both. Measuring them as a pair guards against shipping faster by quietly accepting more breakage.
None of this is wrong, and the argument here is not against DORA. The point is narrower and structural. DORA measures the behavior of a delivery system that assumes humans remain the primary coordination layer — the people who decide what to build, review what gets built, and hold the architecture in their heads. That assumption is exactly what agentic development erodes.
What changes under agentic development
Agentic development is software engineering in which autonomous AI agents, not humans, generate most of the change — and the agentic development metrics that matter most are the ones that survive that shift.
When agents become the execution layer, three things shift at once. Each one moves load onto a part of the system that DORA does not observe.
Human review stops scaling linearly
Generation throughput and review throughput were always coupled by the same constraint: a human had to read the change. Agents break that coupling. Generation scales with compute; review scales with attention, and attention does not get cheaper. As the volume of machine-authored change rises, review becomes the binding constraint, and the temptation is to relax it — to approve faster, sample instead of read, trust the model. What slips through is not usually a broken build. It is a change that compiles, passes tests, and quietly violates an architectural decision the reviewer no longer has time to check. This is how intent debt accumulates: the gap between what the system is supposed to preserve and what its agents are actually constrained to follow.
The bottleneck shifts from implementation to validation
For most of software’s history, writing the code was the expensive step. Agentic tooling inverts that. Implementation approaches free; the cost migrates to validation — proving that a generated change is correct not just locally but against the system’s real constraints. Without a way to express those constraints so a machine can check them, validation falls back onto humans, and the cheap step floods the expensive one. Verification contracts — machine-checkable statements of what a change must satisfy — are the form validation has to take when the volume of change outruns the people reviewing it.
Local correctness diverges from system correctness
An agent optimizes for the task in front of it. The function works, the test passes, the ticket closes. But a change that is locally correct can be globally corrosive: it introduces a second way to do something the codebase already does one way, reaches across a boundary it should respect, or contradicts a decision made three quarters ago in an ADR no one re-read. Repeated across many agents and many PRs, these locally reasonable choices compound. That compounding is architectural drift — distinct from model or data drift — the divergence of generated code from the architectural decisions it was supposed to honor. And because each violation makes the next one look normal, drift propagates: governance propagation is the dynamic by which an unenforced decision decays a little further with every change built on top of it.
The hidden failure mode: DORA metrics can improve while architecture degrades
Here is the failure mode that should worry anyone running an agentic delivery pipeline. Every DORA metric can move in the right direction while the system gets structurally worse.
Deployment frequency rises, because agents ship constantly. Lead time for changes falls, because implementation is no longer the bottleneck. Change failure rate stays acceptable, because the changes pass their tests. Time to restore service holds, because individual fixes are fast. The dashboard is green. Leadership sees an elite-performing delivery system.
Underneath, something else is happening. Duplication spreads because no agent knows what the others already built. Abstractions fragment into near-identical variants. ADR compliance decays one reasonable-looking exception at a time. Invariants that used to hold — this layer never calls that one, this data is always validated here — start drifting. Enforcement that exists in one part of the repo is silently absent in another. None of this registers as a failed deployment. All of it is the architecture coming apart at a speed proportional to how productive the delivery metrics say you are.
DORA is not blind by accident. It was designed to measure delivery-system behavior, and on its own terms it is doing exactly that. The degradation is happening one layer down, in a dimension DORA never claimed to cover: whether the system stayed within its own architectural decisions. That layer has no metrics. That is the gap.
Without governance, delivery metrics stay green while architectural integrity declines
A three-layer model for AI software engineering metrics
The clean way to think about this is as a stack. Each layer answers a different question, and a healthy reading of one tells you nothing about the layer below it.
Layer 1 — Activity. Tokens, lines, commits, agent runs, tasks closed. These measure motion. In isolation they have almost no value, and under agentic generation their capacity to mislead grows, because activity can rise without delivery improving at all.
Layer 2 — Delivery. DORA’s four keys plus broader throughput. This measures what the delivery system produces: working software, shipped at a cadence, with bounded instability. This is real and necessary, and it is where most mature teams stop.
Layer 3 — Governance. The missing layer. In plain terms, governance metrics are codebase-level measures of whether AI-generated change stayed bound to the architectural decisions a system is supposed to preserve. It is the only layer that can see the failure mode in the previous section.
| Layer | What it measures | Failure mode it exposes |
|---|---|---|
| Activity | Raw motion: tokens, lines, commits, agent runs, tasks closed | Busywork mistaken for progress; volume rising with no improvement in delivery |
| Delivery (DORA) | Delivery-system behavior: deployment frequency, lead time, change failure rate, time to restore | Speed bought at the cost of stability, or vice versa |
| Governance | Whether generated change stayed bound to recorded architectural decisions and invariants | Architecture degrading while activity and delivery metrics stay green |
The governance layer is not a brand-new idea so much as a consolidation of measurement practice that already exists in pieces — architecture fitness functions, policy-violation tracking, override analysis — reorganized around a typed corpus of architectural decisions and recast for the speed of machine generation. Concretely, the kinds of signals it would surface include:
- Architectural drift rate — how fast generated code is diverging from recorded decisions (distinct from model drift).
- Governance violation density — constraint violations per unit of generated change.
- ADR conflict frequency — how often new changes contradict an existing decision.
- Policy override frequency — how often a human or agent bypasses an enforced constraint.
- Invariant stability — whether the properties that must always hold are still holding.
- Remediation-loop count — how many correction cycles a change needs before it conforms.
- Decision freshness lag — how stale the enforced decision corpus is relative to the system it governs.
- Provenance completeness — how much of the change set carries a traceable record of what was checked.
- Enforcement coverage — what fraction of the codebase the constraints actually reach.
Activity and delivery metrics cannot see the governance layer
Governance changes the optimization target
The reason this matters is not measurement for its own sake. It is what each layer tells the system to optimize for.
With only activity and delivery metrics in view, the implicit target of an agentic pipeline is maximize generation speed. Ship more, faster, with acceptable failure rates. That target is exactly what produces the green-dashboard-degrading-architecture failure mode, because nothing in it accounts for structural cost.
Add a governance layer and the target changes shape. It becomes maximize sustainable autonomous throughput without destabilizing the system — speed bounded by the constraint that the architecture stays coherent. The same way DORA paired throughput with stability so neither could be gamed, a governance layer pairs generation velocity with structural integrity.
Without governance, the target is “generate faster.” With governance, the target is “generate as fast as the system can absorb without coming apart.” That is the difference between throughput and sustainable throughput.
The analogy
This pattern has played out twice before, and both times the response was the same: when the rate of change rose, the industry added a feedback loop to keep speed from outrunning safety.
CI/CD systematized a fast, automated feedback loop for integration and release. Every commit is built, tested, and validated on its way to production, so the signal about whether a change is safe arrives in minutes instead of weeks. That loop sits before and at deploy time.
Observability added a different loop, after deploy. By making a system’s internal state inferable from its outputs, it let teams operate and debug distributed systems they could no longer reason about by inspection — the foundation that reliability practices like SLOs and incident response are built on top of.
Agentic engineering raises the rate and the opacity of change again, and the historical pattern suggests the same kind of response: a third loop. A governance feedback loop checks generated change against durable intent — the recorded decisions and constraints the system must preserve — before and around the act of generation. It does not replace the delivery loop or the observability loop. It is complementary to both, operating on a dimension neither was built to see.
The enterprise angle
For enterprises, this stops being an abstraction quickly. Enterprises do not actually want maximum AI output. Maximum output is a liability if you cannot account for it. What they want is narrower and harder: predictable delivery, controlled autonomy, explainable change, bounded behavior, and stability that holds as execution velocity climbs.
Those are governance properties, not generation properties. A faster model does not deliver any of them. They come from the system around the model — the layer that decides what an agent is allowed to do, checks that it did only that, and can show its work afterward. As autonomy increases, that layer stops being a nice-to-have process and becomes load-bearing. This is why governance becomes infrastructure: the same way CI and observability became infrastructure once their respective loops became indispensable, governance becomes infrastructure once machines are doing most of the writing.
What this means for measurement
It is worth being honest about scope, because the term “governance metrics” is already overloaded. In AI governance it means compliance and ethics KPIs — bias, fairness, regulatory adherence. That is not what this is. What this essay describes is narrower: architectural governance telemetry, the codebase-level question of whether generated code stayed bound to recorded architectural decisions.
And it is telemetry, not a dashboard product. The defensible first surfaces are conceptual and deterministic, not a new analytics suite. Three are the tractable first surfaces: governance-violation telemetry — what was checked and what failed; architectural-drift indicators — where generated change is pulling away from recorded decisions; and enforcement provenance — a traceable record of which constraints ran against which change and what they decided. All three are repo-native and deterministic, derived from the same enforcement that already runs at hooks and CI rather than from a sampled estimate.
It is worth saying plainly what this is not. It is not a productivity dashboard. It is not a way to rank or compare individual developers — the same misuse DORA explicitly warns against. It is not an AI-ROI score. It measures one thing: whether the system stayed inside its own decisions while the machines worked.
The structural arc
The progression is the point. Activity metrics measure motion and were always weak signal. Delivery metrics — DORA — measure the behavior of the delivery system and remain necessary. Governance metrics measure whether the system stayed architecturally coherent while autonomy rose, and that is the layer agentic development makes unavoidable.
DORA was the right answer to its era’s question: are we shipping well? It is not the wrong answer now. It is an incomplete one, because it answers a question that assumed humans held the architecture together. When agents hold the keyboard, a new question becomes primary — not are we shipping, but are we staying who we decided to be while we ship.
DORA is necessary but insufficient for agentic development. Delivery metrics tell you the system is moving fast and not breaking visibly. Only governance metrics tell you it is still the system you designed.