Did long context windows kill RAG?

No. Long context windows moved filtering into the model and made aggressive reranking and chunking less necessary for simple retrieval. They did not remove the need for retrieval itself, and they did not remove the need for provenance, verification, or governance. The reranker became optional. Retrieval did not.

What is partial-context reasoning?

Partial-context reasoning is the failure mode where a model produces a coherent, confident answer based on incomplete retrieved material — without flagging that key context was missing. Unlike obvious hallucination, it is operationally dangerous because the output sounds right.

Why do 1M context windows create an observability problem?

Once a prompt reaches hundreds of thousands of tokens, no human can reliably inspect what was effectively present, what the model attended to, or what was missing. Confidence in the output becomes decoupled from completeness of the input. Missing architectural context becomes invisible — the failure mode no one can see is the one no one can fix.

Why does this matter for agentic coding systems?

A chatbot missing a postmortem produces an annoying answer. An autonomous coding agent missing an ADR produces architectural drift: deprecated dependencies, violated service boundaries, ignored migration constraints, infrastructure changes built from incomplete context. As agents gain execution authority, incomplete retrieval stops being a search problem and becomes a governance problem.

No. The narrow chunk-retrieve-rerank-truncate pipeline is less critical when the model can hold more of the corpus in context. But deterministic retrieval, provenance, and verification are now more important, because the model is doing more of the filtering invisibly. The category did not disappear — it moved up the stack.

Long Context Does Not Eliminate Governance Infrastructure

The promise of killing RAG

The emotional appeal is real. Retrieval stacks are operationally painful: chunking strategies, reranker tuning, hybrid search, embedding refreshes, eval pipelines that nobody owns. Teams want to delete most of it. The 1M-context narrative promises exactly that simplification — just put everything in the prompt and let the model figure it out.

Engineering teams have started reporting the experiment in public. A common pattern: a team removes its retrieval layer after upgrading to a long-context model, sees it work for single-document lookups, then watches it quietly fail on multi-hop synthesis across postmortems, ADRs, or runbooks. The model does not flag the failure. It produces a coherent answer that happens to be operating on incomplete material.

The failure was not obvious hallucination. The model sounded coherent while operating on incomplete retrieval. That is operationally more dangerous, not less.

Retrieval did not die — filtering moved

The clearest way to read what actually happened is as a shift in where filtering lives.

Old architecture	New architecture
Retrieve narrowly	Retrieve broadly
Rerank aggressively	Reduce strict pruning
Truncate heavily	Allow model to filter internally
Inject tiny context	Inject wide context

The reranker became optional. Retrieval did not.

The thing that disappeared is the brittle middle layer of the old pipeline. The thing that remained — selecting which documents are even candidates for the prompt — still has to happen somewhere. If the candidate set is wrong, no amount of context window saves you. The model can only filter what it was handed.

Long context creates an observability problem

Once a prompt reaches hundreds of thousands of tokens, the observability properties of the system change in ways that matter more than the throughput numbers.

Humans cannot inspect effective context. A million-token prompt is not auditable by reading it.
Teams cannot reliably know what influenced outputs. Attention is opaque; logs of "what was in the prompt" are not the same as "what shaped the answer."
Confidence becomes decoupled from completeness. The model’s fluency does not degrade when key context is missing — only the correctness does.
Missing architectural context becomes invisible. The ADR that was not retrieved is the one nobody notices.

The dangerous failure mode is no longer obvious hallucination. It is partial-context reasoning that appears complete.

That is exactly the failure mode that infrastructure has historically existed to prevent. Tests, code review, type systems, CI, deploy gates — they exist because humans cannot reliably catch silent partial failures by reading the output. Long-context systems reproduce that class of problem in a place that does not yet have its tests-and-CI equivalent: the prompt itself.

Why this matters more for agentic systems

The stakes scale with execution authority. A chatbot missing a postmortem is annoying. An autonomous coding agent missing an ADR is architectural drift.

What that looks like in practice:

Reintroducing a deprecated dependency because the deprecation decision was not retrieved
Violating a service boundary because the boundary lived in a doc that did not get pulled into context
Ignoring a migration constraint because the constraint was in a runbook the model did not see
Generating an infrastructure change from incomplete context because long context made the team trust the prompt to be self-sufficient

As agents gain execution authority, incomplete retrieval becomes a governance problem, not just a search problem.

The model is not the only thing that needs to know which ADRs apply. The system that gates the agent’s action needs to know too — deterministically, repeatably, and on every run.

The next architecture layer

The replacement framing — LLM-with-1M-context replaces retrieval — is the wrong shape. The right shape is layered.

Long context — lets the model attend to more material in a single call
Deterministic retrieval — selects the candidate set with rules a human can read
Provenance tracking — records which decisions were retrieved, surfaced, and acted on
Architectural constraints — encoded as machine-evaluable rules, not paragraphs
Runtime verification — checks that the constraints actually held at execution time

These are not alternatives. They are different jobs that compose. Long context improves how much the model can attend to. Governance before generation determines what the agent is allowed to do with what it attended to. Verification contracts prove that intent survived. Provenance makes the chain auditable after the fact.

The retrieval pipeline got simpler. The infrastructure around the model got more important.

Conclusion

The industry treated long context as a replacement for retrieval infrastructure. In practice, it is becoming a force multiplier for governance infrastructure instead.

The larger the context window becomes, the more important it is to know what the model actually used, what it ignored, and which architectural constraints remained invariant throughout execution.

Bigger windows do not make architecture safer. They make the gaps in architecture harder to see.

The promise of killing RAG

Retrieval did not die — filtering moved

Long context creates an observability problem

Why this matters more for agentic systems

The next architecture layer

Conclusion

Frequently asked questions