In May 2026, GitHub engineers Gaurav Mittal and Reshabh Kumar Sharma published “Validating agentic behavior when ‘correct’ isn’t deterministic.” It is one of the sharper pieces of engineering writing on agent reliability this year, and it names the problem precisely: traditional test scripts assume deterministic correctness, and autonomous agents do not behave deterministically. The same task can be completed by many valid action paths, with UI noise and timing variation along the way. A step-by-step assertion breaks on the first harmless deviation.
GitHub’s answer is what they call, in their own words, an independent “Trust Layer.” Rather than scripting exact steps, they learn a model of correct behavior from a handful of passing execution traces, merge those traces into a graph, and apply dominator analysis — a technique borrowed from compiler theory — to extract the “essential states” every successful run must pass through, while discarding optional noise like loading spinners. A new run is then graded against that learned ground truth. As they put it, this moves “the source of truth from the agent’s internal logic to a learned external structure,” and reports that structural validation beats agent self-grading by a wide margin.
The instinct is exactly right: the system being evaluated should not be the only source of trust. Reliability has to come from something independent of the agent. That is the same instinct that produces every governance layer worth having. The question this article is about is narrower and more useful than “is GitHub right” — they are. It is: what, precisely, does a Trust Layer like this verify, and what does it structurally leave uncovered?
What the Trust Layer actually validates
Read the method carefully and the answer is specific. The Trust Layer validates the run’s behavior: did this execution reach the essential outcomes a successful execution is supposed to reach? It is tolerant of many paths on purpose, because the whole premise is that there is no single correct sequence. It grades behavioral equivalence — was the end state acceptable — and its headline result is telling: an F1-score of 52.2% at distinguishing a genuine agent execution error from an actual product regression.
That number is not a knock. It is the honest shape of the problem. When the environment is nondeterministic and correctness is path-tolerant, the best you can do is grade behavior against a learned model of what acceptable looks like. You sample, you compare, you score. You do not get a guarantee, because there is no fixed thing to check against — the run could have gone a dozen acceptable ways. Behavioral validation is approximation by necessity, and GitHub’s contribution is making the approximation far better than self-reported success.
The Trust Layer answers one question: did this run behave acceptably, given that many paths were valid? Because the environment is probabilistic, that answer is a graded score, not a verdict.
The diff is not nondeterministic
Here is the move the news hook makes easy to miss. Everything in GitHub’s framing is about the run — the trajectory, the sequence of states, the behavior in an environment that refuses to hold still. But an agent that finishes a coding task does not only produce a run. It produces an artifact: a diff. And the diff is not nondeterministic at all. It is a fixed, finite change to the codebase.
That changes the kind of question you can ask about it. “Did the agent behave acceptably?” is path-tolerant and probabilistic, so it must be graded. “Does this diff still match the decisions the team already ratified?” is neither. The diff is fixed. The decisions — the architectural decision records, the boundaries, the dependency rules — are fixed. Comparing two fixed things is not a grading problem. It is a deterministic verdict: the same change, against the same constraints, yields the same pass-or-fail result, on every harness, every time.
This is the inversion at the center of the whole category. Behavioral validation lives on the run axis and is probabilistic because the environment is. Architectural conformance lives on the artifact axis and is deterministic because the decisions are. They are not two rungs on one ladder. They are two different axes, and a Trust Layer can be excellent on the first while having nothing at all to say about the second. The deeper version of why probabilistic generation still needs a deterministic boundary is laid out in the AI stack is rebuilding determinism around probabilistic models; this is that argument made concrete on a single GitHub release.
Same passing run, two different questions: behaved acceptably, and conforms to your decisions
Outcome-acceptable is not decision-conformant
Make it concrete. An agent is given a task. It runs, and GitHub’s dominator model grades the run a clean success — it reached every essential state, no anomalies, classified “Not a Bug” rather than a regression. By GitHub’s measure, the agent behaved correctly. That measurement is right.
And the diff that run produced adds a dependency an architectural decision record prohibits, or reaches across a service boundary the team ratified as off-limits, or re-implements a pattern that already exists somewhere the agent could not see. The behavior was acceptable. The artifact is not conformant. The Trust Layer never had an opinion about it, because it has no model of those decisions — only a model of acceptable outcomes. Behavioral grading is structurally blind to architectural violation, in the same way a passing test suite is: nothing in it represents the constraint that was broken.
A run can be graded “behaved correctly” and still merge a change that contradicts a decision the team made on purpose. Outcome-acceptable and decision-conformant are different properties, judged on different axes. Proving one says nothing about the other.
This is the same boundary, drawn from a different direction, that separates runtime verification from architectural verification: a system can be proven safe in the moment and still drift away from its own design over time. GitHub’s Trust Layer is not a runtime-safety control — it is an evaluation control, grading whether behavior was correct. But it lands on the same far side of the line: it validates the agent, not the system the agent changed.
Different question, different surface, different time
The two also fire at different moments, which is the practical tell that they are different layers rather than competitors.
GitHub’s grading is post-run. It looks at the traces a run left behind and decides, after the fact, whether the behavior was a real failure or an acceptable path — its job is to cut false alarms in CI so teams stop chasing phantom regressions. Architectural conformance fires at the gate: at the pre-commit hook, on the pull request, in the CI check — on the change itself, before it lands, returning the identical verdict no matter which agent or harness produced it. One grades the journey after arrival; the other decides whether the thing you are about to merge is allowed in.
| GitHub Trust Layer | Architectural conformance | |
|---|---|---|
| Judges | the run’s behavior | the diff (the artifact) |
| Against | learned acceptable outcomes | ratified decisions & constraints |
| Nature | graded / sampled | a deterministic verdict |
| Why | the environment is nondeterministic | the diff and the decisions are fixed |
| Fires | post-run, on traces | at the gate, on the change |
| Misses | architectural violation in a clean run | nothing about it — that is its job |
Architectural conformance is what verification contracts are for: the team’s decisions compiled before generation into machine-evaluable constraints that the change is checked against deterministically, with the provenance to say exactly which decision a diff violated and where it came from.
Both, and the half a Trust Layer can’t cover
GitHub’s post is the right kind of news. It is a major platform putting independent verification around agents and proving, with a real method and an honest metric, that agent validation is becoming infrastructure rather than an afterthought. That direction is correct and the engineering is good. An enterprise running coding agents wants exactly this: a way to know whether a run behaved acceptably without trusting the agent’s own say-so.
It is simply one axis of two. The other axis — whether the code that acceptable run produced still holds the line on the architecture it touched — is not something a behavioral Trust Layer was built to judge, and it does not arrive by making the grader better. A better behavioral model raises the F1-score on the run. It never starts representing the decision the diff violated. That requires a layer that evaluates the artifact against your decisions, deterministically, at the gate. Validate that the agent behaved. Then validate that what it changed still matches what you decided. They are both necessary, and proving the first has never once proven the second.
GitHub grades whether the run behaved acceptably. Something still has to decide whether the diff conforms to your decisions. The first is a score on the agent. The second is a verdict on the architecture — and only one of them keeps the system yours.