We Automate What We Can Verify

Abstract

Code was the first big win for AI agents because code is cheap to check. The same rule decides which business processes compress next, and which stay stuck.

In 1494, a Franciscan friar named Luca Pacioli wrote down the rules for double-entry bookkeeping. Every transaction recorded twice, in two columns that have to balance. The whole point of the system is that errors become cheap to catch. If the books don’t tie out, something is wrong, and you know immediately. Modern finance is built on top of this. You cannot scale a business past what one person can hold in their head without it, because nothing in the system is capable of telling you when you’re wrong.

Cheap to check is the unlock. It always has been.

Coding was the first big win for AI agents for the same reason. We hear plenty of nuanced explanations: engineers know the problem, the tooling is mature, the training data is plentiful. All of that is true. It misses the big why.

Code is cheap to check. The compiler accepts it or rejects it. The tests pass or fail. The type checker has a strong opinion and the linter has a louder one. Every layer of a modern codebase is a machine that returns a fast, unambiguous signal about whether what just got written is correct. That signal is what a model learns from in training and what an agent reacts to at runtime. Strip it out and the loop has nothing to converge on.

The same shape repeats across every domain AI has visibly conquered. Chess, Go, math olympiad problems, protein structures scored against a reference. Each one has a cheap, fast, mechanical answer to “is this correct.”

Where the oracle exists, models get sharp. Where it doesn’t, they get fluent. Different thing.

Easier to Check Than to Do

The technical version of this idea has a name. Jason Wei calls it the asymmetry of verification.¹ Some tasks take far less effort to check than to solve. Sudoku is the canonical example. Brutal to solve, trivial to verify. A leetcode problem with a comprehensive test suite is the same shape. So is “find a smaller arrangement that fits these constraints” when you have a script that measures the arrangement.

This asymmetry is the gear that makes the whole system turn. Training a model with reinforcement learning needs a clean signal at every step. If checking the answer is cheap, you can take millions of steps cheaply. If checking requires a human reviewer with a PhD and an afternoon, you take very few steps and they’re noisy.

The same asymmetry governs whether an agent can self-correct in production. Agents converge by being wrong and noticing. Noticing requires a verifier. The compiler, the failed test, the rejected database write, the API that returned 400. Without something in the environment that can definitively say “no,” the loop keeps swinging at air and calls every miss a hit.

Where Verification Is Cheap, AI Eats

Look at where AI has actually made contact with daily work. The pattern is everywhere once you see it.

Frontend code: you look at the page. If it’s wrong, you see it.² Customer service triage: did the ticket get resolved or did the customer come back angry? Document extraction: does the schema validate, do the totals reconcile? Translation between known formats: does it round-trip cleanly? Forecasting: did the number come true?

In every case, the work isn’t “the model got smart.” The work is “someone built or discovered a cheap, trustworthy way to check the answer.” The model attaches to the check and the loop closes. The places AI feels magical are places where verification was already solved or fell out for free.

Where Verification Is Expensive, You Get Vibes

The flip side is uglier and more common.

Strategy decks. Marketing copy. Performance reviews. Threat models. Most consulting deliverables. Anything whose correctness lives in a reader’s judgment, hours or days after it was produced. The model writes these confidently and quickly. You cannot tell, in the moment, whether what you’re reading is right or merely plausible, because the verifier is “an experienced human reading carefully,” and that verifier doesn’t scale and isn’t always available.

This is also where long-horizon agents quietly fall apart. Each step looks fine in isolation. The composite is subtly off and nobody catches it until much later, when the cost of catching it is much higher. The loop didn’t fail because the model is dumb. It failed because nothing in the environment was capable of saying “no” at step three.

A useful heuristic. If you and a competent colleague would disagree on whether the output is correct, and the disagreement would take a meeting to resolve, you do not have a verifier. You have taste. Taste is real and it matters and it will not run a million times overnight.

Design the Verifier First

If verification is the bottleneck, the interesting question stops being “can AI do this task.” It becomes “can I design a cheap, trustworthy way to check this task.” That is the work. Get that right and the automation question answers itself.

Designing a verifier is not necessarily a technical exercise. It is a definitional one, and it usually predates any AI conversation by a long way. Underwriting has a verifier: did the loan perform. Sales has a verifier: did the deal close. Customer support has several: was the ticket resolved, did the customer come back, did the refund get issued. Manufacturing has a verifier in the literal sense, sitting at the end of the line.

The processes in your business that already have one of these are the processes AI is about to compress. The processes that don’t, the ones whose quality is whatever a senior person says it is on a given Tuesday, are the ones that are going to feel stuck. The instinct will be that the model isn’t smart enough yet. The actual problem is that the work has never been defined sharply enough for anything, human or otherwise, to be measured against it.

This is a tractable problem. Most of the time, the verifier exists implicitly and just hasn’t been made explicit. What does a good version of this output look like? What would make a reviewer reject it? How would we know, six weeks later, whether this decision was the right one? Answering those questions is unglamorous work, and it is the work that determines whether AI does anything for you.

Verification is a bottleneck, not a ceiling. Bottlenecks can be widened. Every domain where someone successfully writes down what “correct” means is a domain that becomes available to automate.

Jason Wei, “Asymmetry of verification and verifier’s law”. ↩
Alperen Keles makes this point well in “Verifiability is the limit”, arguing UI dominates AI coding successes because verification is literally looking. ↩

Easier to Check Than to Do

Where Verification Is Cheap, AI Eats

Where Verification Is Expensive, You Get Vibes

Design the Verifier First

Footnotes