The promise of killing RAG
The emotional appeal is real. Retrieval stacks are operationally painful: chunking strategies, reranker tuning, hybrid search, embedding refreshes, eval pipelines that nobody owns. Teams want to delete most of it. The 1M-context narrative promises exactly that simplification — just put everything in the prompt and let the model figure it out.
Engineering teams have started reporting the experiment in public. A common pattern: a team removes its retrieval layer after upgrading to a long-context model, sees it work for single-document lookups, then watches it quietly fail on multi-hop synthesis across postmortems, ADRs, or runbooks. The model does not flag the failure. It produces a coherent answer that happens to be operating on incomplete material.
The failure was not obvious hallucination. The model sounded coherent while operating on incomplete retrieval. That is operationally more dangerous, not less.
Retrieval did not die — filtering moved
The clearest way to read what actually happened is as a shift in where filtering lives.
| Old architecture | New architecture |
|---|---|
| Retrieve narrowly | Retrieve broadly |
| Rerank aggressively | Reduce strict pruning |
| Truncate heavily | Allow model to filter internally |
| Inject tiny context | Inject wide context |
The reranker became optional. Retrieval did not.
The thing that disappeared is the brittle middle layer of the old pipeline. The thing that remained — selecting which documents are even candidates for the prompt — still has to happen somewhere. If the candidate set is wrong, no amount of context window saves you. The model can only filter what it was handed.
Long context creates an observability problem
Once a prompt reaches hundreds of thousands of tokens, the observability properties of the system change in ways that matter more than the throughput numbers.
- Humans cannot inspect effective context. A million-token prompt is not auditable by reading it.
- Teams cannot reliably know what influenced outputs. Attention is opaque; logs of "what was in the prompt" are not the same as "what shaped the answer."
- Confidence becomes decoupled from completeness. The model’s fluency does not degrade when key context is missing — only the correctness does.
- Missing architectural context becomes invisible. The ADR that was not retrieved is the one nobody notices.
The dangerous failure mode is no longer obvious hallucination. It is partial-context reasoning that appears complete.
That is exactly the failure mode that infrastructure has historically existed to prevent. Tests, code review, type systems, CI, deploy gates — they exist because humans cannot reliably catch silent partial failures by reading the output. Long-context systems reproduce that class of problem in a place that does not yet have its tests-and-CI equivalent: the prompt itself.
Why this matters more for agentic systems
The stakes scale with execution authority. A chatbot missing a postmortem is annoying. An autonomous coding agent missing an ADR is architectural drift.
What that looks like in practice:
- Reintroducing a deprecated dependency because the deprecation decision was not retrieved
- Violating a service boundary because the boundary lived in a doc that did not get pulled into context
- Ignoring a migration constraint because the constraint was in a runbook the model did not see
- Generating an infrastructure change from incomplete context because long context made the team trust the prompt to be self-sufficient
As agents gain execution authority, incomplete retrieval becomes a governance problem, not just a search problem.
The model is not the only thing that needs to know which ADRs apply. The system that gates the agent’s action needs to know too — deterministically, repeatably, and on every run.
The next architecture layer
The replacement framing — LLM-with-1M-context replaces retrieval — is the wrong shape. The right shape is layered.
- Long context — lets the model attend to more material in a single call
- Deterministic retrieval — selects the candidate set with rules a human can read
- Provenance tracking — records which decisions were retrieved, surfaced, and acted on
- Architectural constraints — encoded as machine-evaluable rules, not paragraphs
- Runtime verification — checks that the constraints actually held at execution time
These are not alternatives. They are different jobs that compose. Long context improves how much the model can attend to. Governance before generation determines what the agent is allowed to do with what it attended to. Verification contracts prove that intent survived. Provenance makes the chain auditable after the fact.
The retrieval pipeline got simpler. The infrastructure around the model got more important.
Conclusion
The industry treated long context as a replacement for retrieval infrastructure. In practice, it is becoming a force multiplier for governance infrastructure instead.
The larger the context window becomes, the more important it is to know what the model actually used, what it ignored, and which architectural constraints remained invariant throughout execution.
Bigger windows do not make architecture safer. They make the gaps in architecture harder to see.