METR's AI Productivity Studies: Why AI Coding Feels Fast but Measures Slow

Two studies, two different answers

In July 2025, METR published a randomized controlled trial that the AI tooling industry mostly ignored. Experienced open-source developers, working on repositories they knew well, took 19% longer to complete tasks when using early-2025 AI tools than when working without them. Before the trial, the same developers expected the tools to make them 24% faster. The gap between expectation and measurement was roughly 43 percentage points.

In May 2026, METR followed up with a survey of 349 technical workers. The headline numbers went the other direction. Respondents self-reported a median 3x speed gain and a 1.4–2x increase in the value of their work. The same paper included a caveat that is more interesting than the result: earlier research had shown that people overestimate AI's effect on their time spent on tasks by about 40 percentage points on average.

19%

Longer to complete tasks in METR's 2025 RCT — measured, not self-reported

3×

Median self-reported speed gain in METR's 2026 survey of 349 technical workers

~40 pp

Gap between self-reported speedup and measured speedup, per METR's own caveat

Two studies. One says AI made experienced devs slower. The other says technical workers feel two to three times more productive. The reconciliation is not in the data — it is in what each study was actually measuring.

For anyone making decisions about AI adoption, the gap between these findings matters more than either of them in isolation. It tells you which productivity claims to trust, and which infrastructure investments are genuinely paying off.

Two methodologies, two different answers

The RCT and the survey are not just different studies. They are different measurement paradigms.

The 2025 RCT recruited 16 experienced developers working on open-source projects they regularly maintained. Each developer was assigned a set of real tasks from their own repository's backlog. Tasks were randomly split into two arms: AI-allowed and AI-prohibited. Completion times were measured directly. The order of arms was randomized, the developers reported their time-by-task, and the analysis compared task times across conditions while controlling for difficulty.

This is the methodological gold standard for individual productivity measurement: randomization removes selection bias, within-subject design removes between-developer variance, and direct measurement removes self-report bias. The findings have the constraints that gold-standard methods tend to have — small N, narrow population, specific time period — but the internal validity is high.

The 2026 survey took a different shape. METR distributed a questionnaire to 349 technical workers across software engineering, research, academia, and management. Respondents were asked to retrospectively estimate the value and speed gains they had experienced from AI tools across three time points: March 2025, March 2026, and a forecast for March 2027. Responses were aggregated into medians.

This is a recruitment-friendly, scale-friendly design — but every step introduces error in a known direction. Retrospective recall over-weights recent, vivid experiences. Self-selection brings in respondents with strong opinions. Self-report on speed and value is exactly the variable that prior research has shown to be systematically biased.

METR's own paper says as much. They explicitly write that prior work — including their own RCT — should make readers discount survey-reported productivity estimates. They are not pretending the survey settles the question. They are publishing the survey to map perception, not measurement.

What the perception–measurement gap actually means

The temptation is to pick a side. Either AI really does help and the RCT is an outlier, or AI doesn't help and survey respondents are deluded. Neither framing is right.

A more useful read: the two studies measure different things. The RCT measures delivery throughput — how long it takes a developer to actually complete a task from start to merged commit. The survey measures perceived contribution — how much faster or more valuable the work feels compared to a counterfactual without AI tools.

Both are real. But they don't have to move together, and they don't.

Generation throughput went up. Delivery throughput did not. The time saved on typing was reabsorbed into validation, review, and architectural correction. The 3x feel-good number is anchored to the typing-feels-fast part of the work, because that feedback is immediate and visceral.

Delivery throughput is the full picture: typing + reviewing + validating + debugging + correcting + integrating. AI changes the shape of that distribution. Less time on typing. More time on validating that the generated code does what was asked, that it integrates with existing patterns, that it doesn't introduce subtle defects, that it survives review. The RCT measured the full distribution and found that, for experienced developers on familiar code, the new time spent on validation absorbed the time saved on typing — with a 19% surplus.

For unfamiliar code, novice developers, or tasks where typing was genuinely the bottleneck, the balance probably tips the other way. The RCT's population — experienced developers on their own repos — was specifically chosen to be a hard case. The intuition that AI helps more when you know less about the codebase is consistent with both studies once you stop expecting them to give the same answer.

What data teams should actually measure

If you are making decisions about AI tool adoption, the perception–measurement gap is a problem you have to design around. A few practical implications.

Perception metric vs delivery metric

Signal What it measures What to do with it

Self-reported speedup Engagement, tool fit Useful — don't size budget on it

PR cycle time Delivery throughput Track month-over-month

Time-in-review Validation cost shift Watch where saved typing time lands

Rework rate Correction cost Rising rework = drift accumulating

Treat self-report productivity claims as a perception metric, not a delivery metric. If your team says they feel 3x more productive, that is genuinely useful information about engagement and tool fit — but it is not evidence that the team is shipping 3x more. Don't size budget decisions on it.

Instrument delivery directly. Cycle time from first commit to merged PR, time-in-review, change failure rate, and rework rate are all measurable. They are also boring, which is why teams reach for survey-style productivity metrics instead. The boring metrics are the ones that actually move with infrastructure changes.

Separate "easy wins" from "hard cases" in your measurement. AI tools clearly help on greenfield code, scaffolding, well-defined refactors, and standalone utilities. They underperform on changes that need deep familiarity with existing patterns, complex multi-system interactions, or non-obvious constraints. Aggregating across both will produce a misleading mean.

Watch where the saved time goes. When typing time falls and total delivery time doesn't, the time has to be going somewhere. Usually it is going to validation: reading the generated code, checking it against the existing codebase, fixing subtle defects, and resolving review comments. Tracking review time and rework time gives you visibility into the actual cost shift.

For data science teams specifically, the same dynamics show up in notebook-driven workflows. AI-generated cells produce more code faster — and consume more time in debugging, reconciling against existing pipelines, and validating that the analysis is doing what the notebook says it is doing. If your team has adopted AI tools without measuring delivery, you are likely in the same gap.

The infrastructure response

The 19% slowdown is not destiny. It is a measurable cost that gets paid in validation, review, and correction — and the cost can be reduced by infrastructure that makes those steps cheaper.

Three categories of intervention move the needle in practice:

1 Generation-time constraints — Giving the AI tool the conventions, architectural rules, and project context it needs before it writes the code reduces the validation burden afterward. This is the core idea behind shift-left governance for AI-generated code: catch violations at generation rather than review.

2 Deterministic validation — Lint rules, type checks, integration tests, and architectural fitness functions catch a class of issues that human review is bad at and AI tools are inconsistent at. Investing in this layer pays back disproportionately when AI generates more code.

3 Honest measurement — Instrumenting cycle time, review time, and rework rate makes it possible to see whether the tools are paying off. Without measurement, perception fills the gap — and perception, as METR has shown, is unreliable.

The teams that close the perception–measurement gap will be the ones that treat AI productivity as an engineering problem rather than a vibe. The teams that don't will spend the next two years discovering, slowly, that their 3x feeling was a 1x measurement.

Architecture is not a documentation problem. With AI agents in the loop, it is a runtime governance problem — and the cost of not solving it shows up in exactly the gap METR has been measuring.

Two studies, two different answers

Two methodologies, two different answers

What the perception–measurement gap actually means

What data teams should actually measure

The infrastructure response

Related reading