What the Study Found

A 2025 study in Frontiers in Computer Science on team structure and collaboration in DevOps reached a result that cuts against the usual prescription. Adopting DevOps practices was not what separated high performers from the rest. How teams were structured, and how heavily they collaborated, mattered more.

The research examined four ways of organizing the people who build and run software:

  • Siloed teams — development and operations kept apart, work thrown over a wall between them.
  • Separate but collaborative teams — distinct teams that retain their own focus while collaborating intensively across the boundary.
  • Platform-team-supported teams — product teams backed by a dedicated platform team that provides shared tooling, paved roads, and standards.
  • Fully integrated teams — development and operations merged into a single unit with no internal boundary.

The counterintuitive finding: full integration was not always best. The structures that combined specialization with heavy collaboration performed strongly, and the platform-supported model gave teams a shared foundation without forcing everyone into one undifferentiated blob. The lesson is not merge everything. It is keep specialized units, then connect them deliberately.

An Agent Team Taxonomy

The same four formations map cleanly onto how organizations are wiring up AI coding agents. The question the study answered for human teams is the question the industry is now answering, mostly by guessing, for agents: how much should you separate, how much should you merge, and what holds the pieces together.

DevOps formationAgent equivalentHow it behaves
Siloed teamsType 1 — Siloed AgentsCoding, review, and testing agents wired together in a pipeline but sharing little context; each re-discovers what the others already knew.
Separate but collaborativeType 2 — Collaborative AgentsSpecialized agents that share memory, read the same architectural decision records, and apply the same standards across handoffs.
Platform-team-supportedType 3 — Agents + Governance PlatformAgents backed by a knowledge, decision-record, and policy-enforcement layer — the platform-team analogue for machines.
Fully integratedType 4 — Monolithic AgentOne large super-agent that holds every responsibility internally, with no boundary to collaborate across.

The current market mirrors this split almost exactly. One camp bets on a single, more capable agent that does everything — the Type 4 monolith. Another builds multi-agent systems of specialists that pass work between them — Type 1 or, if they invest in shared context, Type 2. A third wraps agents in orchestration and policy layers — Type 3. Nobody has proven which formation wins. If the DevOps study generalizes, the safe bet is not the biggest agent. It is specialized agents that collaborate heavily, sitting on a shared platform that keeps them coherent.

Old Metrics Don’t Capture Agentic Engineering

The instinct, once you pick a formation, is to measure it with the metrics you already trust. For software delivery that means DORA: lead time for changes, deployment frequency, mean time to restore, and change-failure rate. Those four remain useful. They tell you whether work ships quickly and survives contact with production, and that is worth knowing whether the work came from humans or agents.

But DORA was built to measure human teams shipping to production. It says nothing about the questions that decide whether a system of agents is healthy. Are agents reusing the architectural decisions the organization already made, or re-deciding them per task? Is shared context actually reaching the agent at the moment it writes code? When an agent makes an autonomous choice, is that choice consistent with the choices other agents are making elsewhere in the same codebase? A team can post excellent DORA numbers while every agent quietly builds its own incompatible version of the system. Speed measured, integrity unmeasured.

This is the same gap that makes DORA metrics insufficient for agentic development: the outcomes are visible, but the multi-agent collaboration, shared context, and autonomous decisions that produce them are not.

Proposed Agentic Engineering Performance Metrics

If DORA measures delivery, agentic engineering needs a parallel set that measures whether agents are building one coherent system and reusing what the organization knows. Each of the following has a name, a formula, and a reason it matters.

MetricFormulaWhy it matters
Architectural Compliance Ratecompliant changes / total changesWhether agents build one coherent system or many incompatible ones.
Architectural Drift Rateviolations per 100 PRsThe pace at which the system diverges from its intended design.
Context Utilization Raterelevant decisions referenced / availableWhether the knowledge that exists is actually reaching the agent.
Governance Intervention Rateblocked-or-corrected / total agent changesHow often enforcement has to step in — high means agents lack the right context up front.
Agent Coordination Efficiencysuccessful handoffs / total handoffsWhether collaborative agents pass work cleanly or drop it at the boundary.
Decision Reuse Ratereused decisions / total decisions appliedWhether prior decisions compound or get re-litigated per task.
Context Retrieval Precisionrelevant retrieved / total retrievedWhether the shared layer surfaces the right constraints, not noise.
Agent Rework Ratereverted-or-redone changes / total changesHow much agent output has to be undone — the cost of bad context.
Institutional Knowledge Coveragedecisions encoded / decisions that should beHow much of what the org knows is machine-readable and enforceable.
Engineering Governance Effectiveness Score (EGES)composite of the aboveThe DORA equivalent for agentic engineering — a single index of architectural integrity and knowledge continuity.

The composite matters most. Individual rates can be gamed or read in isolation; a single Engineering Governance Effectiveness Score forces the same trade-off DORA forced on delivery, but for integrity. It answers one question a leader can act on: is this organization getting better or worse at keeping its agents coherent over time.

Formation -> Governance -> Performance

The DevOps study modeled a relationship from formation to performance: how you structure teams predicts how well they deliver. Applied to agents, that model is missing its most important term. Formation alone does not produce coherent software; a perfectly chosen agent formation with no shared constraints will still drift. The relationship is Formation → Governance → Performance, and governance is the independent variable the original study did not need to name because, for human teams, hierarchy and collaboration supplied it informally.

Agents have no informal layer. A senior engineer carries the architecture in their head and vetoes the duplicate service; an agent does not, unless the decision is written down and enforced. Without that layer, multi-agent systems drift — architectural drift accumulates at machine speed, and the continuity across agents that the collaborative formation depends on never materializes. With it, the formation pays off: specialized agents collaborate against a shared, enforced set of decisions, and the organization optimizes for architectural integrity and knowledge continuity rather than raw speed.

This is why governance before generation and decision continuity are not add-ons to an agent strategy. They are the variable that converts a good formation into good performance. The metrics above exist to make that variable visible — to show, in numbers, whether the governance layer is doing its job.

Formation is necessary, not sufficient. The DevOps study found structure beats practice. For agents, structure beats nothing without a governance layer underneath it. Pick the collaborative formation, then measure whether shared decisions actually reach and bind every agent.

What Engineering Leaders Should Take From This

The temptation in 2026 is to win the agent race by buying the biggest model and pointing it at the codebase. The DevOps evidence points elsewhere: the strongest structure was specialized units collaborating heavily, supported by a shared platform. Translated to AI engineering, the strongest formation is collaborative agents on a governance layer — Type 2 and Type 3 from the taxonomy above, not the Type 4 monolith.

Getting there takes two moves. First, encode your architectural decisions as machine-readable constraints so agents can reuse them instead of re-deciding. Second, instrument the formation with the metrics that capture what DORA misses, anchored by the Engineering Governance Effectiveness Score, so you can tell whether the governance layer is holding. You can see how a governance layer enforces decisions at generation time in the live demo — the same mechanism that turns these metrics from a wish list into readings off a running system.