Analysis 12 min read

DORA Metrics Are Necessary But Insufficient For Agentic Development

DORA metrics measure how a delivery system behaves, and they still matter. But they were designed for a world where humans remained the primary coordination layer. Under agentic development, where machines generate most of the change, DORA can stay green while the architecture underneath quietly degrades. The missing layer is governance — metrics that measure whether generated code stays bound to the decisions the system is supposed to preserve.

By Theo Valmis·May 2026

The KPI confusion

Most teams now measure autonomous development with numbers that were already controversial before AI existed. Tokens consumed. Pull requests opened. Lines of code generated. Agent tasks completed. Percentage of a backlog closed by a coding agent. These are the dashboards going up in engineering reviews this quarter, and they share one property: none of them prove the engineering system actually got better.

This is an old confusion wearing new clothes. The industry spent a decade learning that activity is not productivity — that counting commits and lines rewards motion, not outcomes. Agentic tooling did not resolve that lesson. It amplified it. When a machine can open forty pull requests in an afternoon, activity metrics stop being merely unhelpful and start being actively misleading, because the volume looks like progress.

AI has made engineering activity easier to measure while making engineering outcomes harder to reason about. The instrumentation got cheaper at exactly the moment the thing worth instrumenting got more complicated.

The right response is not to invent a fifth activity counter. It is to be precise about what each layer of measurement can and cannot tell you — and to notice where the existing layers go silent.

Why DORA still matters

DORA — DevOps Research and Assessment, the research program behind the annual State of DevOps reports and the 2018 book Accelerate by Nicole Forsgren, Jez Humble, and Gene Kim, now part of Google Cloud — got the hardest part right. It refused to measure individual developers and instead measured the delivery system they work inside.

The four key DORA metrics are well known:

Deployment frequency — how often the team ships to production (throughput).
Lead time for changes — time from commit to running in production (throughput).
Change failure rate — the share of deployments that cause a failure (stability).
Time to restore service — how fast the team recovers when one does (stability).

DORA later added a fifth metric, reliability, in 2021, and has since refined its own terminology — time to restore service is now framed as failed deployment recovery time. The classic four remain the set most engineering leaders recognize.

The pairing is the genius of the framework. Throughput and stability are tracked together so that teams are not rewarded for trading one against the other. DORA’s repeatedly validated finding is that speed and stability are not a tradeoff: elite performers score well on both at once, while low performers score poorly on both. Measuring them as a pair guards against shipping faster by quietly accepting more breakage.

None of this is wrong, and the argument here is not against DORA. The point is narrower and structural. DORA measures the behavior of a delivery system that assumes humans remain the primary coordination layer — the people who decide what to build, review what gets built, and hold the architecture in their heads. That assumption is exactly what agentic development erodes.

What changes under agentic development

Agentic development is software engineering in which autonomous AI agents, not humans, generate most of the change — and the agentic development metrics that matter most are the ones that survive that shift.

When agents become the execution layer, three things shift at once. Each one moves load onto a part of the system that DORA does not observe.

Human review stops scaling linearly

Generation throughput and review throughput were always coupled by the same constraint: a human had to read the change. Agents break that coupling. Generation scales with compute; review scales with attention, and attention does not get cheaper. As the volume of machine-authored change rises, review becomes the binding constraint, and the temptation is to relax it — to approve faster, sample instead of read, trust the model. What slips through is not usually a broken build. It is a change that compiles, passes tests, and quietly violates an architectural decision the reviewer no longer has time to check. This is how intent debt accumulates: the gap between what the system is supposed to preserve and what its agents are actually constrained to follow.

The bottleneck shifts from implementation to validation

For most of software’s history, writing the code was the expensive step. Agentic tooling inverts that. Implementation approaches free; the cost migrates to validation — proving that a generated change is correct not just locally but against the system’s real constraints. Without a way to express those constraints so a machine can check them, validation falls back onto humans, and the cheap step floods the expensive one. Verification contracts — machine-checkable statements of what a change must satisfy — are the form validation has to take when the volume of change outruns the people reviewing it.

Local correctness diverges from system correctness

An agent optimizes for the task in front of it. The function works, the test passes, the ticket closes. But a change that is locally correct can be globally corrosive: it introduces a second way to do something the codebase already does one way, reaches across a boundary it should respect, or contradicts a decision made three quarters ago in an ADR no one re-read. Repeated across many agents and many PRs, these locally reasonable choices compound. That compounding is architectural drift — distinct from model or data drift — the divergence of generated code from the architectural decisions it was supposed to honor. And because each violation makes the next one look normal, drift propagates: governance propagation is the dynamic by which an unenforced decision decays a little further with every change built on top of it.

The hidden failure mode: DORA metrics can improve while architecture degrades

Here is the failure mode that should worry anyone running an agentic delivery pipeline. Every DORA metric can move in the right direction while the system gets structurally worse.

Deployment frequency rises, because agents ship constantly. Lead time for changes falls, because implementation is no longer the bottleneck. Change failure rate stays acceptable, because the changes pass their tests. Time to restore service holds, because individual fixes are fast. The dashboard is green. Leadership sees an elite-performing delivery system.

Underneath, something else is happening. Duplication spreads because no agent knows what the others already built. Abstractions fragment into near-identical variants. ADR compliance decays one reasonable-looking exception at a time. Invariants that used to hold — this layer never calls that one, this data is always validated here — start drifting. Enforcement that exists in one part of the repo is silently absent in another. None of this registers as a failed deployment. All of it is the architecture coming apart at a speed proportional to how productive the delivery metrics say you are.

DORA is not blind by accident. It was designed to measure delivery-system behavior, and on its own terms it is doing exactly that. The degradation is happening one layer down, in a dimension DORA never claimed to cover: whether the system stayed within its own architectural decisions. That layer has no metrics. That is the gap.

Without governance, delivery metrics stay green while architectural integrity declines

A three-layer model for AI software engineering metrics

The clean way to think about this is as a stack. Each layer answers a different question, and a healthy reading of one tells you nothing about the layer below it.

Layer 1 — Activity. Tokens, lines, commits, agent runs, tasks closed. These measure motion. In isolation they have almost no value, and under agentic generation their capacity to mislead grows, because activity can rise without delivery improving at all.

Layer 2 — Delivery. DORA’s four keys plus broader throughput. This measures what the delivery system produces: working software, shipped at a cadence, with bounded instability. This is real and necessary, and it is where most mature teams stop.

Layer 3 — Governance. The missing layer. In plain terms, governance metrics are codebase-level measures of whether AI-generated change stayed bound to the architectural decisions a system is supposed to preserve. It is the only layer that can see the failure mode in the previous section.

Layer	What it measures	Failure mode it exposes
Activity	Raw motion: tokens, lines, commits, agent runs, tasks closed	Busywork mistaken for progress; volume rising with no improvement in delivery
Delivery (DORA)	Delivery-system behavior: deployment frequency, lead time, change failure rate, time to restore	Speed bought at the cost of stability, or vice versa
Governance	Whether generated change stayed bound to recorded architectural decisions and invariants	Architecture degrading while activity and delivery metrics stay green

The governance layer is not a brand-new idea so much as a consolidation of measurement practice that already exists in pieces — architecture fitness functions, policy-violation tracking, override analysis — reorganized around a typed corpus of architectural decisions and recast for the speed of machine generation. Concretely, the kinds of signals it would surface include:

Architectural drift rate — how fast generated code is diverging from recorded decisions (distinct from model drift).
Governance violation density — constraint violations per unit of generated change.
ADR conflict frequency — how often new changes contradict an existing decision.
Policy override frequency — how often a human or agent bypasses an enforced constraint.
Invariant stability — whether the properties that must always hold are still holding.
Remediation-loop count — how many correction cycles a change needs before it conforms.
Decision freshness lag — how stale the enforced decision corpus is relative to the system it governs.
Provenance completeness — how much of the change set carries a traceable record of what was checked.
Enforcement coverage — what fraction of the codebase the constraints actually reach.

Activity and delivery metrics cannot see the governance layer

Governance changes the optimization target

The reason this matters is not measurement for its own sake. It is what each layer tells the system to optimize for.

With only activity and delivery metrics in view, the implicit target of an agentic pipeline is maximize generation speed. Ship more, faster, with acceptable failure rates. That target is exactly what produces the green-dashboard-degrading-architecture failure mode, because nothing in it accounts for structural cost.

Add a governance layer and the target changes shape. It becomes maximize sustainable autonomous throughput without destabilizing the system — speed bounded by the constraint that the architecture stays coherent. The same way DORA paired throughput with stability so neither could be gamed, a governance layer pairs generation velocity with structural integrity.

Without governance, the target is “generate faster.” With governance, the target is “generate as fast as the system can absorb without coming apart.” That is the difference between throughput and sustainable throughput.

The analogy

This pattern has played out twice before, and both times the response was the same: when the rate of change rose, the industry added a feedback loop to keep speed from outrunning safety.

CI/CD systematized a fast, automated feedback loop for integration and release. Every commit is built, tested, and validated on its way to production, so the signal about whether a change is safe arrives in minutes instead of weeks. That loop sits before and at deploy time.

Observability added a different loop, after deploy. By making a system’s internal state inferable from its outputs, it let teams operate and debug distributed systems they could no longer reason about by inspection — the foundation that reliability practices like SLOs and incident response are built on top of.

Agentic engineering raises the rate and the opacity of change again, and the historical pattern suggests the same kind of response: a third loop. A governance feedback loop checks generated change against durable intent — the recorded decisions and constraints the system must preserve — before and around the act of generation. It does not replace the delivery loop or the observability loop. It is complementary to both, operating on a dimension neither was built to see.

The enterprise angle

For enterprises, this stops being an abstraction quickly. Enterprises do not actually want maximum AI output. Maximum output is a liability if you cannot account for it. What they want is narrower and harder: predictable delivery, controlled autonomy, explainable change, bounded behavior, and stability that holds as execution velocity climbs.

Those are governance properties, not generation properties. A faster model does not deliver any of them. They come from the system around the model — the layer that decides what an agent is allowed to do, checks that it did only that, and can show its work afterward. As autonomy increases, that layer stops being a nice-to-have process and becomes load-bearing. This is why governance becomes infrastructure: the same way CI and observability became infrastructure once their respective loops became indispensable, governance becomes infrastructure once machines are doing most of the writing.

What this means for measurement

It is worth being honest about scope, because the term “governance metrics” is already overloaded. In AI governance it means compliance and ethics KPIs — bias, fairness, regulatory adherence. That is not what this is. What this essay describes is narrower: architectural governance telemetry, the codebase-level question of whether generated code stayed bound to recorded architectural decisions.

And it is telemetry, not a dashboard product. The defensible first surfaces are conceptual and deterministic, not a new analytics suite. Three are the tractable first surfaces: governance-violation telemetry — what was checked and what failed; architectural-drift indicators — where generated change is pulling away from recorded decisions; and enforcement provenance — a traceable record of which constraints ran against which change and what they decided. All three are repo-native and deterministic, derived from the same enforcement that already runs at hooks and CI rather than from a sampled estimate.

It is worth saying plainly what this is not. It is not a productivity dashboard. It is not a way to rank or compare individual developers — the same misuse DORA explicitly warns against. It is not an AI-ROI score. It measures one thing: whether the system stayed inside its own decisions while the machines worked.

The structural arc

The progression is the point. Activity metrics measure motion and were always weak signal. Delivery metrics — DORA — measure the behavior of the delivery system and remain necessary. Governance metrics measure whether the system stayed architecturally coherent while autonomy rose, and that is the layer agentic development makes unavoidable.

DORA was the right answer to its era’s question: are we shipping well? It is not the wrong answer now. It is an incomplete one, because it answers a question that assumed humans held the architecture together. When agents hold the keyboard, a new question becomes primary — not are we shipping, but are we staying who we decided to be while we ship.

DORA is necessary but insufficient for agentic development. Delivery metrics tell you the system is moving fast and not breaking visibly. Only governance metrics tell you it is still the system you designed.

Frequently asked questions

Do DORA metrics still matter for AI and agentic development?+

Yes. DORA metrics still accurately measure delivery-system behavior, pairing throughput (deployment frequency, lead time for changes) with stability (change failure rate, time to restore service). They remain necessary. The limitation is that they were designed assuming humans are the primary coordination layer, so under agentic development they cannot tell you whether generated code stayed bound to your architectural decisions. They are necessary but insufficient, not wrong.

What are governance metrics in software engineering?+

In this context, governance metrics are architectural governance telemetry: codebase-level measures of whether generated change stayed bound to the architectural decisions and invariants a system is supposed to preserve. Examples include architectural drift rate, governance violation density, ADR conflict frequency, and enforcement coverage. They are distinct from AI-governance compliance KPIs like bias or fairness, and they sit as a third layer above activity metrics and delivery (DORA) metrics.

Can DORA metrics improve while the architecture degrades?+

Yes, and this is the core failure mode under agentic development. Deployment frequency can rise, lead time can fall, and failure and recovery metrics can stay acceptable while duplication spreads, abstractions fragment, ADR compliance decays, and invariants drift underneath. None of that registers as a failed deployment, so the DORA dashboard stays green. The degradation happens one layer down, in a dimension DORA never claimed to measure.

Is governance telemetry just another engineering analytics dashboard?+

No. It is deliberately scoped as deterministic, repo-native telemetry derived from the same enforcement that runs at hooks and CI, not a sampled analytics product. The defensible first surfaces are conceptual: governance-violation telemetry, architectural-drift indicators, and enforcement provenance. It is explicitly not a productivity dashboard, not a tool for ranking individual developers, and not an AI-ROI score.

What metrics should you use to measure agentic and autonomous software engineering?+

Use a three-layer stack. Activity metrics (tokens, lines, commits, agent runs, tasks closed) measure motion and are weak signal on their own. Delivery metrics, the four DORA metrics (deployment frequency, lead time for changes, change failure rate, and time to restore service), measure delivery-system behavior and remain necessary. Governance metrics are the third layer: architectural governance telemetry such as architectural drift rate, governance violation density, ADR conflict frequency, and enforcement coverage, which measure whether AI-generated change stayed bound to the system’s recorded architectural decisions. Activity and delivery metrics cannot see architectural degradation; governance metrics can.