The Bottleneck Has Moved. For Academic Founders in 2026, the Model Is Not It.

Faraz Rizvi × Foundry · 14 May 2026 · Piece 2 of 4 · 11 min read · Markdown

Faraz Rizvi is a UK operator-practitioner writing about the work between a research breakthrough and a fundable company. He runs SpinUp Forge. Foundry is SpinUp Forge's custom agentic harness.

If you are calibrating your founding team's tooling decisions against the AI picture from 2025, you are working from a plan that has already expired. Not because you missed the announcements — but because the drift between one set of measurements and the next was fast enough to make the working assumption wrong. The ten months between July 2025 and May 2026 produced a direction signal, not a settled answer. The question worth asking is whether that signal has moved the actual bottleneck — and for academic founders in 2026, I think it has moved it past the model entirely.

The timeline is the argument. Read the spacing as a rate.

METR direction signal, tool surface drift, 2025 to 2026

July 2025

RCT, practitioners 19% slower using AI tools on familiar codebases.
February 2026

Update, developers now sped up; 30–50% declining to attempt tasks without AI.
May 2026

Survey, retrospective value 1.3x (March 2025) vs 2x (March 2026).

The ten-month spacing between the July 2025 RCT and the May 2026 survey is the rate-of-change signal. METR's own self-skepticism on survey self-report divergence from measured reality is carried intact through the prose below.

The rate-of-change case, in one paragraph

METR measured a 19% slowdown in July 2025 and a 2x retrospective value by May 2026; that span is the evidence.

Has the tool surface actually changed, or does it just feel that way? METR's July 2025 randomised controlled trial found practitioners using AI tools on familiar codebases were, on average, 19% slower than those who were not. The story most people took from that result was "AI doesn't help yet." The story METR were telling was that coordination cost dominates when you import a new tool into a domain you already control. Then, by May 2026, their survey of the same population, using the same instrument and the same wording, showed retrospective value ratings of 1.3x in March 2025 and 2x in March 2026. METR are explicit that self-reports diverge from measured reality — they documented that gap themselves. What the retrospective does claim, and this is harder to attribute to optimism, is that the same instrument pointed at the same population across two periods returns a materially different reading. That is a direction signal, not a productivity claim. (Secondary data from the February 2026 update — including the 30 to 50 percent scope-expansion finding — is in the Evidence note below.)

METR self-reported retrospective value: the same instrument, a year apartSource: METR, May 2026 survey (N=349). Data & provenance: figE-metr-retrospective.provenance.md.

This is not an engineering story

The METR data is about software developers on engineering work; the lesson for an academic founder is scope expansion, not substitution.

Should this data change anything about how your founding team operates? The answer is yes — but only if you read the right lesson from it. All three METR studies measure software developers doing engineering work. If you are reading this as an academic founder, you are not the population they studied, and the work this series is about is not the work they measured. The direction signal transfers; the specific numbers do not.

The finding that applies to you is not the slowdown and not the speed-up. It is what METR's February 2026 update named as scope expansion: experienced practitioners taking on tasks they would not previously have attempted without the tool. The ceiling of what they were willing to try had moved. A one-to-three person spinout cannot shed headcount it does not have. The substitution frame does not apply. What applies is reach. Designing the monthly investor update as a codified procedure with named inputs and a named review gate. Maintaining a financial model that stays current rather than being rebuilt from scratch at each diligence conversation. Running a rolling customer-discovery synthesis rather than a one-time ICURe write-up — ICURe is the UKRI programme that funds academics to test commercial hypotheses — that decays in a shared folder. None of that is a coding task. All of it is now within reach of a founder who is willing to write the procedure.

The model is not the bottleneck

The operator-grade agentic surface is no longer scarce; what is scarce is the architecture built around it.

Is the model itself the constraint on your founding team right now? By May 2026, the answer is almost certainly not. Multi-hour autonomous sessions, file-system access, integration with the tools a spinout already uses: all of this is reachable on a subscription that fits a pre-Series A cash flow. The model capability question, for the work that matters to a small founding team, is largely answered.

The obvious counter is: the models will keep improving — wait, calibrate later. I understand the instinct. It is wrong in one specific way: the procedure you write around a 2026 model does not become obsolete when a 2027 model arrives. It becomes cheaper to run. Waiting is not neutral. It means building nothing that compounds.

Nate B. Jones, writing on 9 May 2026, put the progression precisely: the bottleneck has moved from "is the model smart enough?" to "do you understand the harness?" and then one layer further in from there. What is scarce is the operating architecture. Typed workflows with named inputs and named outputs. A knowledge layer the agent can actually reach — IP register, board pack history, customer notes, financial model assumptions — structured so that retrieval is deterministic rather than approximate. An evaluation habit. A trust contract: a stated position on what the agent may write or send without a human reading it first. The model is a commodity input. The gap is the architecture that would let a small founding team use it at cadence.

Tom Blomfield, in a 2026 YC talk on building self-improving companies, named the same shift: "burn tokens, not headcount." He reports five times the revenue-per-employee at YC demo days compared with eighteen months earlier, flagging the metric as "obviously dumb and gameable at the extreme, but directionally correct." The token-spend frame replaces headcount as the unit through which a small founding team gets more work done — a founding team that cannot afford the third hire can afford the model that does what the third hire was meant to do. The constraint that bites is the cadence of using it well. (Garry Tan's extension of this framing to all knowledge work is in the Evidence note.)

The three-layer test

Access, meaning, and authority: three questions most spinout founders have not answered in writing.

What does the gap actually look like in practice? Jones, writing three days earlier in "AI Work Primitives: Access vs Meaning" (6 May 2026), named three layers: access, meaning, and authority. Access is whether the tool can reach the input. Meaning is whether it understands what the input is for. Authority is what you will let the tool write or decide without your review.

Apply those layers to the monthly board pack a seed-stage spinout sends its early investors. Access: can the tool reach the finance model, the customer notes, the runway spreadsheet — or does the founder copy-paste each one into a prompt window every time? Meaning: does the tool know the ARR movement figure matters in light of the prior quarter's forecast, not just as a standalone number? Authority: what will the founder not allow it to produce without reading line by line, and have they stated that clearly — or is it an implicit anxiety that makes the review process slower than doing it by hand?

Most spinout founders I have spoken to have not answered any of those questions in writing. That is the gap. Not the model. Not even the harness, in 2026. The workflow design discipline that would let a founding team capture the scope expansion the METR data describes — that is what is missing.

What is now writeable

In 2026, a non-engineer assembles the operating architecture conversationally; the bottleneck is the precision of the description.

What changed between 2025 and 2026 is reachability. Building this kind of operating architecture used to require engineering taste. In 2026, a non-engineer assembles the surface conversationally — the agent helps build it. The bottleneck is no longer "can I get the harness running." It is "can I describe a procedure precisely enough that an agent can execute it twice the same way."

That is a writing problem. Academic founders, of all populations, are equipped for it.

The investor conversation is shifting to match. Innovate UK's March 2026 restructuring to a Velocity account-management model — positioning IUK as a "trusted due diligence engine" for deep-tech companies — names this shift at programme level (UKRI announcement, March 2026): the assessment has moved from technology alone to operational maturity alongside it. The question has moved from "do you use AI?" to something closer to "what does your team produce reproducibly, and what does the production chain look like?" A founding team that can answer the second question concretely is being evaluated on evidenced operational capacity — what they demonstrably ship, on cadence — not on a proxy measure of runway times headcount.

The procedure, once written, also survives the next model upgrade. A standing architecture that knows the company's IP narrative, the lead investor's stylistic preferences, the quarterly KPI baseline is not a chat habit. It accretes. Each workflow codified once makes the next one cheaper to build.

What comes next

Piece 3 shows what the substrate looks like on a specific Tuesday morning, in the artefacts it produces and the time it returns.

Piece 3 returns to what this looks like in operational detail: the six workflows a one-to-three person spinout actually needs across its first 18 months, the before-and-after on a single Tuesday morning, and what an investor looking at a seed-stage spinout in 2027 is likely to want to see that goes beyond the team slide. The artefacts and the investor conversation are the ground-level argument. This piece is the framing for why that argument is worth taking seriously in 2026 rather than 2027.

The procedure is what survives

The model generation will change; the procedure that uses it, once written, outlasts each upgrade.

The founding teams that will be ahead on this in two years will not be ahead because they used a particular model. They will be ahead because they wrote the procedures early enough that the substrate knows the company — runs on schedule, accretes depth, gets cheaper each time one more workflow is codified. The model underneath it is a commodity. The procedure is the asset.

If you are still asking whether the tools are good enough: they are. If you are asking whether you have built anything around them that runs next month without you rebuilding it from scratch — that is the question worth answering now.

Sources

Evidence note

The February 2026 scope-expansion finding: METR's February 2026 update stated publicly that developers were now sped up relative to 2025 estimates, and that 30 to 50 percent of developers were declining to submit certain tasks because they no longer wanted to attempt them without AI. This is the scope-expansion finding — the ceiling of what practitioners would attempt had moved upward. It is demoted here because the May 2026 retrospective comparison (the anchor number: 1.3x → 2x) is the more direct direction signal, and both cannot be the anchor in the same section without the body becoming data-dense (METR, February 2026 update).
METR N=349: The May 2026 survey covered 349 respondents. The self-report caveat — that survey responses diverge from controlled-experiment measurements of actual productivity — is METR's own documented position, carried intact throughout.
Garry Tan / Tokenmaxxing: Tan's framing in the YC Lightcone episode extends the token-spend argument beyond software — "every thing that we would call knowledge work could be token maxed" — with the operator supplying intent and the machine supplying execution. Both the Blomfield and Tan framings emerged from a US software seed cadence; a UK academic spinout reads them with care. The binding constraints in the spinout's first 18 months are cash runway, the TTO (technology transfer office — the unit inside a university that turns research into licences and companies) timeline, and the grant cycle, not headcount fungibility. Within those constraints, the substitution is still real (YC Lightcone, 2026).
AI-readiness as a capability claim: The distinction between a team that has built repeatable workflows and one that has not shows up in the quality of artefacts before it shows up in anything the founders say. This is a judgement claim, not a cited figure.
The IUK Velocity six-sector claim: Velocity restructuring covers six priority sectors. The "trusted due diligence engine" framing and account-managed pipeline description are from the UKRI March 2026 announcement and should be cross-checked against implementation updates before re-publication (UKRI, March 2026).
Method and caveats: METR's self-skepticism about self-report diverging from measured reality is carried intact, and no productivity claim is made from survey figures alone; the February and May 2026 posts are cited separately because they do different jobs (acknowledgement and scope-expansion versus retrospective comparison); the access/meaning/authority framing is Jones's, applied here to a board-pack context not in the original; the Blomfield and Tan quotations carry their own caveats and are read as scope-expansion guidance, not literal productivity multipliers for UK academic spinouts.

← Back to SpinUp Forge