For a few years, the most valuable skill in applied AI was knowing how to phrase a request. Teams competed on prompt templates. The model was the product, and the input was the surface you tuned.

That framing has quietly stopped describing what people build. Production AI is no longer a single interaction. It is a loop: the model proposes, a tool runs, state changes, the model proposes again, something retries, a sub-agent takes over, the run continues across hours and sessions. Once that is the shape of the work, the input is no longer the thing you engineer. The system is.

Harness engineering is the emerging name for that system. It is an emerging term in the AI engineering community, gaining traction through 2026, for the discipline of designing the systems, constraints, feedback loops, and observability that wrap around an AI agent to make it reliable in production. The vocabulary is real and converging across independent sources, but it is not yet canonical — different frameworks still draw the boundaries differently. What is clear is the direction: the center of gravity moved from the input to the runtime. In short: prompt engineering optimizes a single input to the model; harness engineering designs the agent runtime system that wraps the model.

Why prompt engineering exploded

Prompt engineering exploded because it worked. When the entire interaction was one call — one input, one output — the input was the only lever you had, and pulling it produced real, measurable gains.

A better prompt could change a great deal in a single call:

  • Format and structure of the output
  • Reasoning style and how much the model showed its work
  • Tone, verbosity, and adherence to a house format
  • How reliably the model stayed on the requested task

This was not a fad. Within the boundary of a single call, the input is genuinely the highest-leverage variable, and it remains so. The discipline did not become wrong. It became insufficient for a class of system that single-call thinking never anticipated.

Why it became insufficient

The limit is structural, not a matter of prompt quality. Prompt engineering optimizes one input to one call. Production AI is no longer one call.

Modern agentic systems are stateful, multi-step, and tool-using. They carry context across many turns, dispatch tool calls, mutate external systems, recover from partial failures, and run long enough that the early input is a small and fading fraction of everything the model has since seen.

In that setting, most failures are not input failures. They are system failures: a tool returned something unexpected and nothing recovered, state drifted between steps, a retry double-applied a side effect, a hand-off between sub-agents lost the thread. A better prompt cannot fix a system problem. You cannot phrase your way out of a missing retry policy or an absent state model. The lever moved.

The unit of engineering changed. Prompt engineering optimizes an input to a call. Harness engineering designs the system the calls run inside. When production AI stopped being a single interaction, the second became the discipline that determines whether the first even matters.

Prompt engineering vs harness engineering: from inputs to systems

The new surface is not a longer prompt. It is a set of system concerns the input never touched. A harness is the layer around the model that runs the agent loop, and in practice it typically coordinates several of these:

  • Stateful workflows — carrying and compacting context, memory, and working state across many steps and sessions rather than within one call.
  • Tool execution — routing and dispatching tool calls, validating schemas, sequencing where order matters, and handling tools that fail or return surprises.
  • Runtime memory — persistence, retrieval, and the notes and progress logs that let one run pick up where the last one stopped.
  • Retry systems — idempotent retries, backoff, and the bookkeeping that keeps a recovered step from double-applying its effects.
  • Multi-agent coordination — routing, sub-agent hand-offs, and the orchestration that drives multi-step, long-horizon execution, sometimes including durable background runs.

Not every harness includes all of these, and the field has not settled on one boundary — some scope the term narrowly to the execution loop, others fold in scaffolding like system prompts and tool descriptions. But the through-line is consistent: these are properties of a system, not qualities of an input. None of them can be expressed as a better sentence at the top of the context window.

Prompt engineering · optimize the input Input Model Output Harness engineering · design the system (the loop around the model) Model Tools State Retry / route loop until done

Prompt engineering tunes the input; harness engineering designs the system loop

This is why harness engineering reads as a structural evolution rather than a trend. It follows a path the industry has walked before. Just as continuous integration and delivery went from a novel practice to taken-for-granted infrastructure — an automated integration-and-delivery feedback loop — and just as observability went from optional to expected as the way teams infer a system’s internal state from its external outputs, harness engineering could become infrastructure on a similar trajectory. The parallel is one the AI community itself draws. It is not yet an accomplished fact; adoption is early. But the shape is familiar.

Systems reliability is not architectural reliability

Here is where the analogy needs a sharp edge. Observability is a foundation for reliability, not reliability itself — it gives teams the visibility to preempt and resolve failures, but seeing a system is not the same as constraining it. The harness inherits the same boundary.

A well-built harness can make a system reliable in the operational sense. It runs. It retries. It recovers. It completes the task and produces a trace of what it did. That is a real and hard-won property, and it is exactly what harness engineering is for.

It is also not the same property as architectural reliability. A run can succeed on every operational measure and still produce code that violates the architecture — reaches across a boundary it was never allowed to cross, introduces a dependency the team ruled out, drifts away from a decision the organization actually made. The harness made the system reliable. It said nothing about whether the output was correct against the architecture, because correctness against architecture was never one of its inputs.

A reliable system is not a governed one. The harness guarantees the run completes. It does not guarantee the run respected the rules. Those are different reliabilities, and only one of them is about whether the code should have been written that way at all.

This is the seam where the harness ends and another layer begins. For the full case that the harness is the runtime and not the rulebook, see the pillar, What Is Harness Engineering. For why a reliable run still needs an independent check on its output, see the companion piece, Harness Engineering Needs a Verification Layer.

The governance implication

Pull the two reliabilities apart and the missing layer comes into focus. A system that is reliable but not governed still lacks the thing that decides what the agent is allowed to build — not what it can do, but what it must not.

The harness coordinates execution. It does not encode architectural intent, resolve which decision wins when two constraints overlap, or refuse an output that contradicts a ratified decision. Those are governance properties, and they live at a different layer — one that operates before and around generation rather than alongside it. This is the work of architectural governance: turning the decisions a team has already made into constraints the system is held to. And the only point at which a constraint can prevent a violation rather than merely report one is governance before generation — at the moment of intent, not after the run has already shipped the drift.

Prompt engineering optimized the input. Harness engineering designs the system. Neither was built to answer the question that outlasts both: when the model proposes something the architecture forbids, what stops it? That is not an input problem and not a runtime-reliability problem. It is a governance problem, and it is the next layer up.

The discipline keeps climbing the stack. Inputs, then systems, then the constraints those systems must hold to. Each layer is necessary and none is sufficient alone. The harness makes the agent reliable. Governance makes it accountable to the architecture.