A model, on its own, does one thing: it turns an input into an output. It does not call tools, recover from a failed step, remember what it did ten minutes ago, or carry a task across a long-running job. Everything between that single output and a system that actually works in production is built by hand.

That “everything between” now has a name. Harness engineering is an emerging term in the AI engineering community, gaining traction through 2026, for the discipline of designing the systems, constraints, feedback loops, and observability that wrap around an AI model to make it reliable in production. The harness is the execution layer between a model and the system it powers.

What is harness engineering?

A harness is the layer around the model that runs the agent loop. The model decides; the harness makes the decision actionable, durable, and repeatable. Where prompt engineering optimized a single call, harness engineering builds the runtime that turns that call into a working system.

In practice, a harness typically coordinates:

  • Tool calls — routing and dispatching the model’s requested actions to real functions, APIs, and side effects, then feeding results back into the loop.
  • Retries and error recovery — retrying transient failures, backing off on rate limits, and avoiding double-applying side effects when a step partially succeeds.
  • Context and state — loading the right working set, compacting or evicting stale history, and persisting memory and retrieval so the agent does not start from zero each turn.
  • Routing across steps — deciding what runs next, switching models where appropriate, and handing off to sub-agents.
  • Multi-step execution — driving the plan-act-check loop across many iterations, often as long-running or background runs that persist state between steps.

Not every harness includes all of these, and the boundaries are not settled. Some sources scope the “harness” narrowly as the execution loop and treat system prompts and tool descriptions as scaffolding — a related but distinct term — while broader usage folds both together. The term is real and convergent enough to use without scare quotes; it is not yet canonical, and different frameworks use the same word slightly differently. The through-line across all of them is consistent: the harness is where execution happens.

The model is the smallest part of the system. A model answers a single call. A harness is what makes that call survive contact with tools, failures, long horizons, and state.

AGENT HARNESS execution layer Model decides Tool calls dispatch Retries recovery Context state & memory Routing next step Multi-step long-horizon

The harness is the execution layer that runs the loop around the model

Why prompt engineering stopped being enough

Prompt engineering was the right discipline for a world where the model was the product and a single, well-phrased call was the whole interaction. Optimize the input, get a better output. For a chat assistant, that framing held.

Production AI broke it. Real systems are multi-step, stateful, and tool-using. They call APIs, write files, run for minutes or hours, and pick up where a previous session left off. A perfect prompt does nothing for retry logic, nothing for context that overflows the window, nothing for a tool that returns garbage on the third call. Those are runtime problems, and runtime problems live in the harness, not the prompt.

This is the shift from optimizing inputs to engineering systems. It is the subject of the companion piece Prompt Engineering Was About Inputs. Harness Engineering Is About Systems., which traces the discipline as it moved from phrasing to architecture. The distinction also matters for governance: a prompt is guidance, not enforcement, which is why prompt engineering is not governance — a well-phrased instruction cannot guarantee a constraint holds.

The rise of execution runtimes and agent harnesses

As the work moved into the harness, the harness stopped being something every team rebuilt from scratch. Managed agent runtimes, orchestration frameworks, and background-execution platforms now own the loop: they handle tool dispatch, state persistence, retries, routing, and long-running execution as a product surface rather than glue code.

The shape is consistent across the ecosystem. The model is wrapped in an agent execution runtime, and that runtime — not the model call — becomes the boundary teams build against. A community vocabulary has formed around it: the widely-repeated framing that an agent equals a model plus a harness, a growing body of engineering writing on what a harness should and should not do, and shared lists of harness patterns and tooling.

The details differ by framework, and the term is still being defined in public. But the direction is not in dispute. Orchestration is becoming a layer you adopt, not a layer you reinvent — and that is exactly what happens to a capability on its way to becoming infrastructure.

Harness engineering is becoming infrastructure

This trajectory is familiar, because the industry has watched it twice before.

CI/CD started as a novel practice and became taken-for-granted infrastructure. At its core it is an automated integration-and-delivery feedback loop: every change is built, tested, and validated on its way to production, so the signal about whether it is safe arrives in minutes instead of weeks. Few teams now treat having a pipeline as a decision.

Observability followed the same path. It is the ability to infer a system’s internal state from its external outputs — logs, metrics, traces — a discipline rooted in control theory. It is a foundation for reliability, not reliability itself: it gives teams the visibility they use to preempt and resolve failures. It, too, went from a differentiator to a baseline expectation.

Harness engineering may be on a similar trajectory. The AI community draws this parallel itself, borrowing CI/CD and SRE framing to describe the harness as the feedback loop around an agent. Infrastructure waves tend to run in the same order: first capability (can we make it work at all), then scale (can we run it everywhere, reliably), then governance (can we constrain what it is allowed to do). Adoption is still early in 2026, so the honest framing is that harness engineering is following a path CI/CD and observability already walked — not that it has arrived.

Capability, then scale, then governance. Every infrastructure wave runs in that order. Harnesses are deep into capability and scale. The next layer up is the one they do not provide.

The stack: generation, orchestration, governance

It helps to see harness engineering as one layer in an execution stack rather than a standalone idea. Three layers, each answering a different question, each unable to answer the one above it.

  • Generation — the model. It produces candidate actions and text. It answers: what could come next?
  • Orchestration — the harness. It runs the loop, dispatches tools, retries, holds state, and drives multi-step work. It answers: how does the action actually execute and persist?
  • Governance — the layer above. It decides whether a generated change is allowed to stand against the system’s architectural decisions and constraints. It answers: should this have happened at all?

The harness is the orchestration layer. It is where execution is coordinated — not where intent is generated, and not where architectural integrity is enforced. This is an execution stack, a description of where work happens, not a metrics model or a maturity scale.

Governance should this change stand? Orchestration · the harness how does it execute and persist? Generation · the model what could come next?

Generation produces, orchestration executes, governance decides what is allowed

Reading the stack this way makes the boundary obvious. A better model improves generation. A better harness improves orchestration. Neither, by construction, decides whether the result conforms to the decisions the system was built to preserve. That is a separate concern, and it sits one layer up.

What harnesses do not do

A harness coordinates execution. It does not preserve architectural integrity. Those are different problems, and conflating them is the most common mistake in agent design.

Retries make a failed step succeed; they do not check whether the successful step respected a boundary it should not have crossed. Memory and context management keep an agent coherent across a session; they do not encode which architectural decisions are non-negotiable. Tool routing dispatches the right action; it does not verify that the action conforms to a repo standard, a security constraint, or an operational invariant. The harness is built to make the agent act reliably, which is not the same as making it act within the lines.

So the failure mode is structural. A harness can run flawlessly — every tool call dispatched, every retry clean, every session resumed — while the change it produces quietly contradicts an ADR, reintroduces a pattern the codebase already abandoned, or reaches across a boundary that was supposed to hold. That divergence is architectural drift, and it accumulates one locally-reasonable step at a time. The harness has no opinion about it, because preventing drift was never its job.

This gap is the subject of the companion piece Harness Engineering Still Needs Governance, which works through exactly what harnesses solve, where they go silent, and why observability of execution is not the same as enforcement of intent.

A harness makes an agent act reliably. It does not make an agent act within the lines. Reliable execution of the wrong change is still the wrong change.

The next layer: governance

If capability and scale come before governance in every infrastructure wave, then governance is not an add-on to harness engineering. It is the inevitable layer above it — the one that becomes load-bearing precisely when harnesses succeed and autonomous work scales.

Governance infrastructure sits above orchestration and answers the question the harness cannot: should this change stand, given what the system already decided about itself? That requires architectural decisions expressed as machine-evaluable constraints and checked deterministically, before and around generation — not inferred after the fact from execution traces. The broader case for this as a distinct category is made in The Next AI Infrastructure Category Is Governance.

There is a verification dimension to this as well. Coordinating execution is not the same as verifying that what was executed is correct against durable intent, which is why the missing layer in harness engineering is also a verification layer — the argument in the companion piece The Missing Layer in Harness Engineering Is Verification.

The stack, then, is the whole picture. Generation produces. Orchestration executes. Governance decides whether the execution was allowed to stand. Harness engineering is the discipline of building the middle layer well. It is real, it is becoming infrastructure, and it is necessary — and it is not the top of the stack.

Harness engineering is the execution layer. Governance is the layer that decides what execution is allowed to mean. Build the harness; then build the layer above it.