The Generative AI Software Engineering Stack

The short answer. Generative AI software engineering is now a seven-layer stack: foundation models, context and retrieval, agent runtime, tooling and execution, governance and architectural control, validation and evaluation, and human oversight. Layers 1–3 are crowded with well-funded vendors. Layer 4 is consolidating around MCP. Layer 5 — governance — is structurally underbuilt because the problem only becomes felt at the scale where layers 1–4 have already done their work. That is the wedge.

Why a stack frame at all

Workflows describe how one team uses tools today. Stacks describe the layers any serious organization eventually has to operate, regardless of which vendors it picks. A stack frame makes it easier to see which layers are crowded with capital, which are underbuilt, and where the architectural risk concentrates. It is also how engineering leaders actually think when they plan a 24-month bet: layer by layer, with explicit owners and explicit failure modes.

The seven-layer frame below is a reference, not a product taxonomy. It will read familiar to anyone who has shipped against agentic coding tooling in the last year — the layers are the ones the field has converged on, even if the labels have not.

The seven layers

Layer 01 · Foundation models

Raw reasoning and generation

OpenAI, Anthropic, Google, Meta, DeepSeek, Mistral, xAI

Purpose. Raw reasoning and generation capability. The substrate every higher layer depends on. The frontier moves quarterly; the API surface is increasingly commoditized behind OpenAI-compatible endpoints.

Layer 02 · Context & retrieval

Provide relevant context to models

RAG pipelines, vector databases, embeddings, semantic search, memory systems

Purpose. Surface the right tokens at the right time so the model can answer with grounded context. Effective for documentation lookup; structurally insufficient for authoritative constraint enforcement, which requires precedence and exactness rather than nearest-neighbor recall.

Layer 03 · Agent runtime / orchestration

Coordinate tools, steps, and agents

LangGraph, CrewAI, AutoGen, OpenAI Agents SDK, Claude Agent SDK, Claude Code, Cursor Agent

Purpose. Plan multi-step work, route between tools, coordinate sub-agents, persist intermediate state. The layer where "agentic" actually happens. Crowded with frameworks — the runtime question is settling, but the choice still matters because each runtime exposes a different seam to the layer above.

Layer 04 · Tooling & execution

Take actions in real systems

MCP servers, REST & gRPC APIs, databases, shells, browsers, CI/CD, cloud infrastructure

Purpose. Give the agent hands. Reading is cheap; writing is where blast radius lives. Model Context Protocol has emerged as the connective tissue, with the major coding tools converging on a shared interface for tool exposure. The execution surface is wider than most teams realize once the agents start running unattended.

Layer 05 · Governance & architectural control

The Mneme category

Decision corpora, precedence engines, pre-generation enforcement, override discipline, cross-tool governance

Purpose. Make the agent answer to the project's existing decisions before it generates.

Enforce architectural decisions across heterogeneous tools.
Prevent silent drift between PRs, branches, and engineers.
Inject structured constraints into context before generation.
Preserve decision continuity as models, agents, and codebases churn.
Validate outputs against governance rules at the seam, not in review.

Layer 06 · Validation & evaluation

Measure correctness and adherence

Benchmarks, policy tests, regression suites, eval harnesses, observability, tracing

Purpose. Quantify reliability, correctness, and governance adherence after the fact. Eval answers "did the system do the right thing?"; observability answers "what did it actually do?". Both presume layers below them are stable enough to measure.

Layer 07 · Human oversight

Organizational accountability

Code review, architecture review, security review, approvals, escalation paths

Purpose. The accountability boundary. Humans approve, escalate, and own the consequences. Linear in throughput by design; this is the layer where the AI throughput delta becomes a queueing problem.

Each layer has a purpose, an emerging interface, and at least one credible vendor or pattern. They are not optional in any serious deployment. Even teams that "only use Cursor" are operating implicitly across all seven — they have just not separated the layers, which is why drift becomes structural rather than measurable.

The throughput vs. governance gap

The cleanest framing for the entire stack is one sentence: AI coding increased generation throughput. Governance and review did not scale at the same rate. Layers 1–4 are throughput layers — they make more code happen, faster. Layer 7 is a human layer that scales linearly with people. Layers 5 and 6 are the only places where the asymmetry can be closed, and only one of them — layer 6 — is well-resourced.

Throughput vs. control, by layer

Layers 1–4 · throughput

Models, retrieval, agent runtimes, tools and execution. Each new generation of these layers increases the rate of code change. Vendor-funded, capital-rich, moving fast.

Layer 5 · control (pre-generation)

Governance and architectural control. Operates before a write. The only layer that can keep architectural drift bounded as throughput scales. Structurally underbuilt today.

Layer 6 · control (post-generation)

Eval, regression, observability. Measures what happened. Can close some of the gap, but cannot prevent the merge that should never have been generated. Necessary, not sufficient.

Layer 7 · control (human)

Reviewers, architects, security. Headcount-bound. Rate constant; cannot absorb 10× throughput growth without becoming the bottleneck.

The implication is direct: as long as layers 1–4 keep accelerating, the only sustainable response on the control side is to push enforcement earlier. That is the structural argument for layer 5.

Layer 5, in detail

Layer 5 is the layer that operates on the agent before it generates. It is not eval (that is layer 6) and not review (that is layer 7). It is the layer that holds the project's accumulated decisions — ADRs, dependency policies, service boundaries, naming conventions, security invariants — in a structured, queryable form that the agent has to traverse before producing output.

The five jobs of layer 5, in concrete terms:

Enforce architectural decisions. A decision the team made six months ago is treated as a hard constraint, not a passage in a CLAUDE.md the model is asked to respect. Prompt engineering is not governance; the difference is whether the constraint is enforceable at the seam.
Prevent drift. Two engineers prompting the same agent should not get architecturally divergent answers. The decision corpus is the shared anchor that keeps generations consistent across people, branches, and time.
Inject constraints before generation. Hooks at the file-write seam, system-prompt augmentation at session start, and structured precedence resolution when decisions conflict. The point is that the agent does not need to "remember" the decision — the decision is presented to it.
Preserve decision continuity. Models change. Agents change. Tools change. The decision corpus persists. Continuity across this churn is the compounding asset.
Validate outputs against governance rules. Pre-merge gates that read structured artifacts (not freeform prose) and fail closed. Less "AI reviewed this" and more "the deterministic rule that always runs has cleared this."

Why layers 1–3 are crowded

The capital allocation in this cycle has been straightforward: the further down the stack a layer sits, the more it looks like classic infrastructure, and the more it has attracted infrastructure-scale investment. Foundation models (layer 1) absorbed the bulk. Context and retrieval (layer 2) took the second wave, with a vector-database boom that has since consolidated. Agent runtimes (layer 3) are the current frontier — the SDK and framework wars are happening here in 2026.

Layer 4 is settling on MCP as the default execution interface, which is why the long tail of "API connector" startups has compressed in the past year. Layer 6 has its own established ecosystem (Braintrust, Langfuse, OpenAI Evals, OSS harnesses). Layer 7 has existed for fifty years.

Layer 5 is the gap. It is not crowded for a structural reason: governance only becomes a felt problem at the scale where layers 1–4 have already done their work. Early adopters can get away with prompt-engineered style guides and CLAUDE.md files. Later adopters — the ones running multiple coding agents across multiple repos — cannot. The pull is now arriving, which is why the category is forming. Heterogeneous-agent governance is the felt version of the problem.

What a serious layer 5 looks like

A serious layer 5 is not a config file and not a prompt. It is an addressable, structured decision corpus with a precedence engine on top, accessed through hooks the agent runtimes and editors all defer to. Concretely, the things it has to do:

Structured decisions. Decisions encoded as records with status, scope, supersession history — not paragraphs in a markdown file. Queryable, not summarizable.
Precedence resolution. When org policy and team override and per-PR exception conflict, a deterministic rule decides which one wins. Real teams hit this in week three.
Pre-generation hooks. Enforcement at the seam where the agent writes — SessionStart, PreToolUse, file-write hooks. Structurally different from "we put it in the system prompt."
Tool-agnostic surface. The same decision corpus is consulted by Claude Code in the terminal, by the Cursor agent, by the Copilot extension, and by the SDK bot opening PRs in CI. No re-encoding per tool.
Override discipline. When a rule is weakened, the override is itself a tracked decision. An untracked override is a silent merge.
Auditability. The system can answer "which decisions applied to this generation, and why?" after the fact — for retrospectives, for security review, for regressions.

The wedge. Almost everyone is competing in layers 1–3. Very few are building layer 5 seriously. The teams that ship a credible governance layer in 2026–2027 will be the reference for how engineering organizations operate AI coding once the building work is no longer the bottleneck.

The strategic point

Layers 1 through 3 are competitive markets with deep-pocketed incumbents. Building there means competing with frontier labs and venture-funded framework teams on their home ground. Layer 4 is consolidating around an open standard. Layer 6 has known patterns. Layer 7 cannot be vendored away.

Layer 5 is the layer where the product surface is still being defined, the customer pull is now arriving, and almost no one in the dominant ecosystems has shipped a coherent answer. That is the strategically scarce layer of this stack — and the one that determines, more than any other, whether AI-assisted engineering produces durable systems or expensive drift.

This is the frame the rest of the Mneme insights catalogue extends. Why RAG fails for governance covers why layer 2 cannot substitute for layer 5. Why prompt memory fails at scale covers why CLAUDE.md is not layer 5. Why code review cannot scale covers the layer-7 ceiling. Heterogeneous-agent governance covers the cross-tool surface layer 5 has to defend.

FAQ

Why describe AI coding as a stack instead of a workflow?

Workflows describe how a single team uses tools today. A stack describes the layers any serious organization eventually has to operate, regardless of which vendors it picks. Stacks make it easier to see which layers are crowded, which are underbuilt, and where the architectural risk concentrates. The seven-layer frame here is a reference, not a product taxonomy.

Which layer is the most strategically scarce?

Layer 5 — governance and architectural control. Layers 1 through 3 are crowded with well-funded vendors. Layer 4 is consolidating around MCP. Layer 6 (eval) and layer 7 (human oversight) have established practices. Layer 5 is where almost no incumbent has a coherent answer, because operational governance only becomes a felt problem at the scale where building work is already done.

Is governance just review or eval under a different name?

No. Eval (layer 6) measures whether an output is correct against a benchmark. Review (layer 7) checks whether a human approves the change. Governance (layer 5) operates before the output exists: it injects architectural constraints into context, blocks generations that would violate them, and keeps the decision corpus consistent across heterogeneous tools. The three layers are complementary, not substitutes. See review is not governance.

Why is layer 5 underbuilt if it's so important?

Two reasons. First, the building layers (1–3) had clearer customer pull early, because nothing else worked without them. Second, governance is structurally hard: it requires a precedence engine, persistence across sessions, tool-agnostic enforcement, and a decision corpus that doesn't degrade as the codebase grows. The category formed only once teams started shipping enough AI-generated code for drift to become measurable.