The Bottleneck Has Moved. For Academic Founders in 2026, the Model Is Not It.
Faraz Rizvi is a UK operator-practitioner writing about the work between a research breakthrough and a fundable company. He runs SpinUp Forge.
Something shifted between early 2025 and early 2026, and most founding teams I speak to have not yet updated their working model of what the tools can do. The question worth asking is not whether AI is useful; almost everyone in this audience has already answered that. The question is whether the playbook you are running was calibrated against a tool generation that no longer exists.
The shape of the eight-month drift in the underlying instrument is easier to read on a timeline than as a sequence of citations. The three METR data points below sit in temporal relation to each other; the spacing is the point.
-
July 2025
RCT, practitioners 19% slower using AI tools on familiar codebases.
-
February 2026
Update, developers now sped up; 30–50% declining to attempt tasks without AI.
-
May 2026
Survey, retrospective value 1.3x (March 2025) vs 2x (March 2026).
The rate-of-change case, in one paragraph
METR measured a 19% slowdown in July 2025 and a 2x retrospective value by May 2026; that span is the evidence.
METR's July 2025 randomised controlled trial on experienced open-source developers produced a result that received wide circulation: practitioners using AI tools on familiar codebases were, on average, 19% slower than those who were not. Then, in February 2026, METR stated publicly that they believed developers were now sped up compared to those 2025 estimates, and noted that 30 to 50 percent of developers were choosing not to submit certain tasks because they no longer wanted to attempt them without AI. By May 2026, METR's survey of the same population, using the same instrument and the same wording, showed retrospective value ratings of 1.3x in March 2025 and 2x in March 2026. METR are explicit that survey self-reports diverge from measured reality; they documented that gap themselves. What the retrospective does claim, and this is harder to attribute to optimism, is that the same instrument pointed at the same population across two different periods returns a materially different reading. That is a direction signal, not a productivity claim. The tool surface has moved fast.
This is not an engineering story
The METR population was software developers; the academic founder is not that population, and the work is not that work.
Here is the pivot the METR data requires. All three METR studies measure software developers doing engineering work. If you are reading this as an academic founder, you are not the population they studied, and the work this series is advocating is not the work they measured.
The rate-of-change signal matters here as context, not as instruction. What the 2025 slowdown documented was the coordination cost of importing AI into a domain you already controlled well. What the 30 to 50 percent scope-expansion finding from METR's February 2026 update describes is something different: experienced practitioners taking on tasks they would not previously have attempted without the tool. The ceiling of what they were willing to try had moved. That second finding is the bridge to the operational context.
A one-to-three person spinout cannot shed headcount it does not have. The substitution frame, AI replacing a task someone did well, does not apply to a founding team where half the necessary work is simply not getting done. What applies is scope expansion: the surface of work the same team can credibly take on has grown, and the work that matters is operational, not engineering. Designing the monthly investor update as a codified procedure with named inputs and a named review gate. Maintaining a financial model that stays current rather than being rebuilt from scratch at each diligence conversation. Running a rolling customer-discovery synthesis rather than a one-time ICURe write-up that decays in a shared folder. None of that is a coding task. All of it is now within reach of a founder who is willing to write the procedure.
The model is not the bottleneck
A subscription that fits pre-Series A cash flow reaches the full agentic surface; what is scarce is the architecture built around it.
By May 2026, the operator-grade agentic surface is no longer scarce on any axis a one-to-three person founding team is likely to hit first. Multi-hour autonomous sessions, file-system access, integration with the tools a spinout already uses: all of this is reachable on a subscription that fits a pre-Series A cash flow. The model capability question, for the work that matters to a small founding team, is largely answered.
Nate B. Jones, writing on 9 May 2026, put the progression precisely: the bottleneck has moved from "is the model smart enough?" to "do you understand the harness?" and then one layer further in from there. The harness, in 2026, is accessible. What is not answered is whether you have built anything around the model that uses its capability in a repeatable way.
What is scarce is the operating architecture around the model. Typed workflows with named inputs and named outputs. A knowledge layer the agent can actually reach: IP register, board pack history, customer notes, financial model assumptions, structured so that retrieval is deterministic rather than approximate. An evaluation habit, even a minimal one. And a trust contract: a stated position on what the agent may write or send without a human reading it. The model is a commodity input. The gap is the architecture that would let a small founding team use it at cadence.
The rate-of-change signal that the METR data carries has an operating-economics consequence one layer downstream: token spend is replacing headcount as the unit through which a small founding team gets more work done. A founding team that cannot afford the third hire can afford the model that does what the third hire was meant to do, and the constraint that bites is the cadence of using it well.
Tom Blomfield, in a 2026 YC talk on building self-improving companies, names the same shift more bluntly, "burn tokens, not headcount", and reports five times the revenue-per-employee at YC demo days compared with eighteen months earlier, while flagging the metric as "obviously dumb and gameable at the extreme, but directionally correct". Garry Tan, in the YC Lightcone "Tokenmaxxing" episode, extends the same framing beyond software: "every thing that we would call knowledge work could be token maxed", with the operator supplying intent and the machine supplying execution. The framing emerged from a US software seed ICP, demo day, Series A, Series B cadence, and a UK academic spinout reads it with care: the binding constraints in the spinout's first eighteen months are cash runway, the TTO timeline, and the grant cycle, not headcount fungibility. Within those constraints, however, the substitution is real: the operator the spinout cannot afford to hire is precisely the work the model can carry, when the procedures are written down.
AI-readiness, read this way, is a capability claim, not a product claim. What discriminates is not which tools a founding team lists on a slide. It is whether they have built something that runs repeatably next month, and the month after. The distinction between a team that has done this and one that has not shows up in the quality of the artefacts before it shows up in anything the founders say.
The three-layer test
Access, meaning, and authority: three questions most spinout founders have not answered in writing.
Jones, writing three days earlier in "AI Work Primitives: Access vs Meaning" (6 May 2026), proposed a three-layer architecture: access, meaning, and authority. Access is whether the tool can reach the input. Meaning is whether it understands what the input is for. Authority is what you will let the tool write or decide without your review.
Apply those layers to the monthly board pack a seed-stage spinout sends its early investors. Access: can the tool reach the finance model, the customer notes, the runway spreadsheet, or does the founder copy-paste each one into a prompt window every time? Meaning: does the tool know the ARR movement figure matters in light of the prior quarter's forecast, not just as a standalone number? Authority: what will the founder not allow it to produce without reading line by line, and have they stated that clearly, or is it an implicit anxiety that makes the review process slower than doing it by hand?
Most spinout founders I have spoken to have not answered any of those questions in writing. That is the gap. Not the model. Not even the harness, in 2026. The workflow design discipline that would let a founding team capture the scope expansion the METR data describes is what is missing.
What is now writeable
In 2026 a non-engineer assembles the operating architecture conversationally; the bottleneck is the precision of the description.
What changed between 2025 and 2026 is reachability. Building this kind of operating architecture used to require engineering taste. In 2026, a non-engineer can assemble the surface conversationally, because the agent helps build it. The bottleneck is no longer "can I get the harness running." It is "can I describe a procedure precisely enough that an agent can execute it twice the same way."
That is a writing problem. Academic founders, of all populations, are equipped for it.
The investor conversation is shifting to reflect this. The question has moved from "do you use AI?" to something closer to "what does your team produce reproducibly, and what does the production chain look like?" Innovate UK's March 2026 restructuring to a Velocity account-management model, positioning IUK as a "trusted due diligence engine" for deep-tech companies, names this shift at programme level: the assessment has moved from technology alone to operational maturity alongside it (UKRI announcement, March 2026). A founding team that can answer the second question concretely has changed what the investor is evaluating. The shift is from a proxy measure (runway times headcount) to something more like evidenced operational capacity: what the team demonstrably ships, on cadence, with and without the tools visible.
The procedure, once written, is also what survives the next model upgrade. A standing architecture that knows the company's IP narrative, the lead investor's stylistic preferences, the quarterly KPI baseline: that is not a chat habit. It accretes. Each workflow codified once makes the next one cheaper to build. The tool generation will change; the procedure outlasts it.
What comes next
Piece 3 shows what the substrate looks like on a specific Tuesday morning, in the artefacts it produces and the time it returns.
Piece 3 returns to what this looks like in operational detail: the six workflows a one-to-three person spinout actually needs across its first 18 months, the before-and-after on a single Tuesday morning, and what an investor looking at a seed-stage spinout in 2027 is likely to want to see that goes beyond the team slide. The artefacts and the investor conversation are the ground-level argument. This piece is the framing for why that argument is worth taking seriously in 2026 rather than 2027.
The procedure is what survives
The model generation will change; the procedure that uses it, once written, outlasts each upgrade.
The model is not the constraint. What is now scarce is the operational discipline to use it at cadence: named procedures, a knowledge layer, a stated position on review. The founding teams that are ahead on this in two years will not be ahead because they used a particular model. They will be ahead because they wrote the procedures early enough to have built something with real depth: a substrate that knows the company, runs on schedule, and gets cheaper each time one more workflow is codified.
The direction signal from METR's own data is clear. The surface that was measured as limiting in early 2025 is not the surface that exists in May 2026. The procedure-first argument was not wrong then. It runs on present-tense evidence now.
Sources
- METR: Early 2025 AI Experienced OS Dev Study (July 2025)
- METR: We Are Changing Our Developer Productivity Experiment Design (February 2026)
- METR: Measuring the Self-Reported Impact of Early-2026 AI on Technical Worker Productivity (May 2026)
- Nate B. Jones, Codex Plugins: Why the AI Bottleneck Moved to Workflow (9 May 2026)
- Nate B. Jones, AI Work Primitives: Access vs Meaning (6 May 2026)
- UKRI, New plan to help the next generation of tech businesses thrive (Velocity, March 2026)
- Y Combinator, Tom Blomfield, How to Build a Self-Improving Company with AI
- YC Lightcone, Garry Tan, Tokenmaxxing: How Top Builders Use AI To Do The Work Of 400 Engineers
Sourcing notes: The rate-of-change argument relies on METR's own framing of the drift between their 2025 RCT and their 2026 survey and update post. METR's self-skepticism about self-report divergence from measured reality is carried intact; no productivity claim is made from survey figures alone. The February 2026 update and the May 2026 survey are cited separately throughout because they do different jobs: the February post is the acknowledgement and the scope-expansion finding; the May post is the retrospective comparison. The three-layer architecture (access / meaning / authority) is Jones's framing, applied here to a spinout board-pack context not discussed in the original post. The Blomfield "burn tokens, not headcount" quotation carries Blomfield's own caveat on the metric being "obviously dumb and gameable at the extreme, but directionally correct"; the line is not presented as institutional YC position. The Tan "knowledge work could be token maxed" line is a verbatim extract from the YC Lightcone "Tokenmaxxing" episode and is read here as scope-expansion guidance, not as a literal productivity multiplier for UK academic spinouts. The IUK Velocity claim is from the UKRI March 2026 announcement.
← Back to SpinUp Forge