The interface is changing
Claude Code’s /goal command lets a developer set a completion condition and then keep working across turns until a smaller model decides the condition has been met. The shift it points at is bigger than the feature itself.
The unit of work is no longer the prompt. It is the objective. The developer defines what done looks like and the agent decides which intermediate moves to make. Review collapses from every turn to every loop.
Programming is becoming search
Andrej Karpathy’s AutoResearch is the cleanest example of this pattern outside of editors. The agent proposes a change, edits code, runs a short experiment, measures the result, keeps or reverts, and repeats. It is not a chatbot. It is a search loop with code as the move set and a metric as the fitness function.
The developer is no longer directly writing every candidate solution. The developer defines the search space, the success condition, and the evaluation loop. That is a different job than what prompt engineering describes.
The general shape of the new loop:
- Objective. A measurable goal or completion condition.
- Candidate change. Agent edits code, config, or schema.
- Execution. Run tests, benchmark, or experiment.
- Measurement. Did the metric improve?
- Decision. Keep, revert, or retry, and loop.
Once coding becomes search, the agent stops being a junior engineer that needs instruction and starts being an optimizer that needs constraints. That distinction matters.
Shopify shows this is not just research
Shopify’s engineering team generalized AutoResearch beyond model training and used it to improve more than 40 metrics across the company. That is the proof point that turns the pattern from interesting research toy into mainstream engineering practice.
Once a team can give an agent a metric and a loop, “make this faster,” “reduce memory usage,” “improve conversion,” or “fix all failing tests” becomes an executable instruction. That is powerful. It is also dangerous if the metric is the only thing the agent is optimizing for.
Goal-driven agents optimize for what they can measure. Architectural intent is usually not measurable unless it has been made explicit, retrievable, and enforceable.
Metrics are not architecture
This is where the model breaks down without an additional layer. A loop that scores itself only against tests and metrics can pass while quietly violating things the team cares about:
- A test suite can pass while architectural boundaries are violated.
- A benchmark can improve while the agent introduces an unwanted dependency.
- A performance metric can improve while maintainability degrades.
- A goal can be completed while the implementation contradicts an ADR.
None of these are bugs in the loop. They are limits of what the loop knows to check. The verifier confirms that the objective was reached. It does not confirm that the agent stayed inside the architecture while reaching it.
The enterprise layer: this is not just a developer-tool shift
IBM’s 2026 CEO study reports that 76% of surveyed organizations now have a Chief AI Officer, up from 26% in 2025, and that 64% of CEOs say they are comfortable making major strategic decisions based on AI-generated input. The broader framing in the report is that AI is forcing companies to redesign decision-making and authority structures.
Software engineering is experiencing the same operating-model shift earlier and more visibly. Coding agents are becoming delegated decision-makers inside the SDLC. The governance question is the same at both levels: when an autonomous system can pursue a goal on its own, what constraints must it obey while pursuing it, and how are those constraints enforced rather than hoped for?
The missing layer: governance before generation
The full stack of a goal-driven agent is not just “goal plus model.” It has five distinguishable layers, and most current tooling addresses only the first four.
Without governance, the loop only knows whether it got closer to the goal. It does not know whether it crossed a boundary the organization cares about. Tests and metrics describe outcomes. Governance describes the parts of the system that are not in scope for the agent to change in the first place.
A governed objective loop is an agent workflow where a goal, metric, and verifier drive autonomous execution, while architectural constraints define what the agent is allowed to change along the way.
The governed execution stack
Drawn as a pipeline, the layers look like this. The agent loop and the verifier do not change; governance is the additional checkpoint between the loop’s output and the surfaces it would otherwise write to unchallenged.
/goal, AutoResearch, managed agents — the harness driving the loop.
The same stack reads three different ways depending on which question you are asking. Mneme sits on the right-hand side of each axis.
The orchestration and execution layers above Mneme are the exploration / optimization / execution side. They are where most current investment is going, and they are improving fast. Mneme’s scope is the invariant / governance / verification side — the layer that decides whether what was explored is allowed to land.
Where Mneme fits
Mneme is the architectural constraint layer for governed objective loops. It turns ADRs, dependency rules, scope boundaries, and governance decisions into retrievable, machine-evaluable constraints that an agent receives before it generates a candidate change, and that CI can check after the change lands.
That means a goal-driven agent does not only receive the task. It receives the architectural context for the task: which abstractions it must go through, which dependencies are out of bounds, which ADRs the area it is touching is already pinned to.
Concrete examples of the same goal, governed differently:
The loop still does the search. The verifier still decides whether the goal was met. Governance constrains the search space so the goal cannot be reached by violating something the team has already decided.
Worked example: the goal succeeds, the architecture fails
The failure mode that motivates governance is not the agent missing the goal. It is the agent hitting the goal in a way that quietly contradicts a decision the team has already made. A concrete walkthrough:
CheckoutService, bypassing the Payments module to skip a layer of validation.checkout/ and payments/, evaluates the diff, returns a deterministic verdict: FAIL · ADR-007 violated: direct Stripe import outside payments module. The verdict links back to the source ADR. The change does not land.Glossary
- Goal
- The desired outcome the agent is pursuing.
- Metric
- The measurable signal used to judge progress.
- Verifier
- The mechanism that decides whether the goal has been met — tests, benchmarks, or a judging model.
- Governance
- The architectural, security, dependency, and policy constraints that must remain true while the agent works.
- Drift
- The gap between intended architecture and actual implementation, accumulated as agents make locally plausible choices the system never sanctioned.
- Governed objective loop
- An agent workflow where a goal drives autonomous execution and architectural constraints bound what the agent is allowed to change while pursuing it.
Closing
The next programming model is not prompt engineering. It is objective design. Developers define goals. Agents explore solutions. Verifiers measure outcomes. But teams still need a way to preserve architectural intent while the loop runs.
Goal-driven agents make software faster. Architectural governance keeps that speed from becoming drift.