A new paper, Constraint Decay: The Fragility of LLM Agents in Backend Code Generation, gives a useful name to a failure mode many engineering teams are already starting to feel. The authors study how LLM agents perform when backend generation tasks require not only functional correctness, but also adherence to structural constraints such as architectural patterns, databases, object-relational mappings, and framework conventions. Their finding is direct: agents perform well under loose specifications, but degrade as structural requirements accumulate. The paper evaluates 80 greenfield generation tasks and 20 feature-implementation tasks across eight web frameworks, using both behavioral tests and static verifiers. Capable configurations lose around 30 points on average in assertion pass rates from baseline to fully specified tasks.
That phenomenon is what the authors call constraint decay.
It is an important phrase because it separates two different problems.
The first problem is obvious: the generated code does not work.
The second problem is more dangerous: the generated code works, but violates the structure of the system.
- It bypasses the intended data layer.
- It ignores the ORM convention.
- It places logic in the wrong boundary.
- It follows the endpoint contract but not the architectural contract.
- It satisfies a test while weakening the codebase.
That distinction matters.
Functional correctness tells you whether the output behaves as expected. Structural correctness tells you whether the output preserves the system it was supposed to extend.
The problem is not that the agent cannot write code. The problem is that it cannot reliably preserve the rules that make the code belong to this system.
Constraint decay becomes architectural drift
Constraint decay is the local failure mode.
Architectural drift is the accumulated consequence.
A single agent-generated change that ignores an ORM pattern may look harmless. A single shortcut around a service boundary may even pass review if the behavior is correct. But when agents are producing more code, more frequently, across more surfaces, those small structural violations compound.
Over time, the system starts to diverge from its intended architecture.
The issue is not that the team forgot its architecture. The issue is that the architecture was never made executable at the point where agents were generating code.
| Failure mode | What broke | Where it shows up |
|---|---|---|
| Functional failure | The code does not work | Tests, runtime errors |
| Constraint decay | The code works but ignores structural rules | Per-PR — if anyone looks for it |
| Architectural drift | Decay accumulated across the codebase | Months later, in rework and incidents |
This is the core governance gap in AI-assisted development.
Teams already have decisions. They have ADRs, conventions, code review norms, database boundaries, framework preferences, and hard-won lessons about what not to do. But most of those constraints remain written for humans. They live in documents, comments, onboarding conversations, and senior engineers’ heads.
Coding agents do not reliably preserve that kind of context.
They need executable boundaries.
Tests are necessary, but not sufficient
One of the most useful parts of the paper is its evaluation design. The authors use both end-to-end behavioral tests and static verifiers. That separation is critical. Behavioral tests evaluate whether the generated application works. Static verifiers evaluate whether the code satisfies structural requirements.
That maps directly to the infrastructure gap emerging around coding agents.
Tests validate behavior. Governance validates intent.
A test can tell you whether an endpoint returns the right response. It may not tell you whether the implementation used the approved repository pattern. It may not detect that a dependency crossed the wrong layer. It may not know that the team has an ADR prohibiting a certain storage path, framework shortcut, or migration pattern.
In traditional development, senior engineers often caught these issues during review. That worked when code volume was human-paced.
AI changes the economics.
If agents increase the volume of generated code, and structural validation remains concentrated at PR review, then the review queue becomes the governance layer by accident. Senior engineers become constraint recovery systems.
That is not scalable.
Backend systems expose the problem faster
The paper’s backend focus is especially useful because backend systems make structural decay harder to hide.
Frontend demos can often look plausible while hiding poor structure. Backend systems have more explicit architectural commitments: data access patterns, schema constraints, framework conventions, service boundaries, API contracts, and runtime behavior.
The authors find significant variation across frameworks. Agents do better in minimal, explicit environments such as Flask and worse in more convention-heavy environments such as FastAPI and Django. They also identify data-layer defects, including incorrect query composition and ORM runtime violations, as leading causes.
That is the part engineering leaders should pay attention to.
The more a system depends on conventions, implicit architecture, and layered data access, the more fragile agent-generated code becomes without governance.
This is not a prompt problem alone.
You can put more instructions in the prompt. You can ask the agent to be careful. You can paste the architecture into context. Those things may help, but they do not create a reliable enforcement layer.
Instructions are not invariants.
The next layer is architectural governance
The AI coding stack is still heavily focused on generation.
- Better models.
- Better IDEs.
- Better autocomplete.
- Better agent loops.
- Better test generation.
- Better PR summaries.
All of that matters.
But as generation improves, the bottleneck shifts.
The question becomes: who preserves the architecture?
That is where architectural governance becomes infrastructure. Not governance in the abstract enterprise-policy sense. Governance as executable technical constraints inside the development workflow.
A governance layer should be able to answer questions like:
- Does this change violate an architectural decision?
- Did the agent introduce a dependency across a forbidden boundary?
- Did it bypass the approved data access pattern?
- Did it modify a surface that should be out of scope?
- Can we trace the violation back to the decision it contradicts?
- Can this check run before the PR review queue?
That is the shift from documentation to verification contracts.
Architecture cannot remain passive context when agents are actively generating code. It has to become enforceable.
Where Mneme fits
This is the category Mneme is being built for: architectural governance for AI-assisted development.
Mneme is not trying to be another semantic memory layer or generic RAG system. The goal is to make architectural decisions enforceable across the places where AI-assisted development happens.
That means turning decisions, ADRs, and project constraints into checks that can run before generation, during agent workflows, and in CI.
A coding assistant can generate.
A test suite can validate behavior.
A PR reviewer can still apply judgment.
But the architectural invariants should not depend entirely on late human review. They should be available as executable guardrails.
That is why constraint decay is such a useful research framing. It gives language to the exact failure mode Mneme is designed to address.
- Constraint decay is what happens when agents lose structural fidelity.
- Architectural drift is what happens when that decay compounds across a codebase.
- Architectural governance is the missing control layer.
Generation needs boundaries
The answer is not to slow coding agents down.
The answer is to give them better boundaries.
As AI-assisted development becomes normal, teams will not only ask whether agents can produce more code. They will ask whether agents can preserve the architecture, respect the decisions already made, and avoid shifting structural risk into review queues.
That is the next maturity step.
The future of AI-assisted development will not be won by generation alone. It will be won by generation plus governance.