← mame.io

v1.1.0

[ 02 / NOTES ]

Verification as a design input

3 min read
aiengineeringverification

Garry Tan wrote recently about the "agent complexity ratchet": AI coding agents have collapsed the cost of comprehensive testing, and 90% coverage — once reserved for avionics and medical devices — is now the default a serious project should aim for. The full piece is worth reading.

His framing starts when you're already coding. Tests appear alongside code; the ratchet runs forward from there. Reading it, I kept landing on a related question: if writing tests is now cheap, what changes about how we design software in the first place?

The economics are worth stating plainly. Engineering organizations adopting AI coding agents are reporting 3x–5x increases in PR output per engineer. Code generation is no longer the bottleneck. Verification is. Reviewing, testing, and gaining confidence that the generated code does what it should — and continues to do so as the codebase evolves — is now where most of the real work lives. Catalini, Hui, and Wu make the same point at the economy-wide scale in Some Simple Economics of AGI (arXiv:2602.20946): as the cost of machine execution races toward zero, the binding constraint on realized value becomes human verification bandwidth. The constraint shifted. Our habits haven't caught up.

The shift I'd propose is small but consequential. Treat verification as a design input, not a design output.

In practice this means two things.

First, before the LLM writes any code, it asks: how will we know this works, and how will we know it keeps working? Requirements that can't be verified get flagged. Not as testing problems for later — as design problems for now. If you can't tell whether a thing is doing what you wanted, you don't have a finished design; you have a wish.

Second, every change is framed around the contracts it creates or modifies — the things the system must always do, never do, or guarantee under specific conditions. The LLM names them, scores them for criticality and risk, and asks for review on the high-stakes ones. The rest, it declares and ships. Contracts live as files in the repo, versioned in git, reviewable, easy to correct when the LLM gets it wrong.

The rigor scales to the stakes. A weekend prototype doesn't need the same treatment as code that moves money. A spike still records its contracts — but doesn't enforce tests for low-stakes ones. When the spike turns into a product, the thinking is already done.

I've put a first iteration in a public template repo: github.com/matteomelani/verification-template. It includes the instructions LLMs read at session start, file formats for contracts and decisions, and pointer files so Claude Code, Cursor, Codex, and Copilot all follow the same approach on a given project.

This is v0. I'm testing it on a new project now and will report back on what works and what breaks. If you try it before I'm done, let me know what you find.