Michael Rothrock

Writing

Thinking out loud about AI agent reliability, verification design, and what the data actually says.

You Already Have a Process

Teams keep asking how to catch weird behavior from production AI agents. The better question: do you have a process? Map it once with a big agent, then run targeted agents against the steps you already run.

Plan to Throw One Away

Fred Brooks said it in 1975, and it's more true now than ever. AI agents drive the cost of code to zero, which finally makes the first version disposable. The code is disposable. The understanding is not.

The Model IS the Pipeline

Is model choice the most important thing in getting good results from agents? For me, no. The harness around the model is doing the work. Each stage produces a verification surface.

Share the Thing

I open-sourced a small ML utility 11 days ago. 3,000 people are using it. ~100 pulls are coming from teams extending it for NVIDIA Blackwell silicon. If you're sitting on something useful, share it.

Same Gates, Three Models

Does the verification topology generalize? I ran the same 11 gates unchanged across three medical imaging models. Rejection rates scaled cleanly with model weakness: 4.8%, 11%, 93%.

I Built an ML Architecture Lab in Go

Define a model in JSON, train on your Mac, ship to a cloud GPU. No code changes. Open source: mixlab.

Delegate Outcomes, Not Tasks

Good managers delegate tasks. Great managers delegate outcomes. The same is true for managing AI agents. Define the gates that do the managing, while you do the leading.

The Revision Problem

Tasks that fail early and get revised have half the downstream failure rate. The most expensive thing a pipeline can do is let bad work through early.

Questions I Ask Every Agent

I build by asking questions, not by issuing commands. Four questions from my Claude Code logs that make the biggest difference.

Cognitive Debt

As agents write more code, we're trading tech debt for cognitive debt. Two strategies to stay connected to code you didn't write.

The Blind Spot Map

Low overlap doesn't mean you're covered. Map your error types to your gates — the empty cells are where your next investment should go.

Stage Coverage Beats Gate Density

Adding more reviewers doesn't help if they're all looking at the same thing. Checks at different stages catch fundamentally different errors.

Errors Compound Forward

Coding agents don't make random mistakes. 91% of failures are predictable — systematic errors and omissions that compound through every stage of the pipeline.

The Terraform Destroy

A coding agent issued a terraform destroy in dev. The fix wasn't better reviewers — it was a deterministic gate that routes only what matters to humans.

The Missing Gate

A hallucinated company name in a marketing report. The root cause wasn't the model — it was a missing gate early in the pipeline.

Three Robot Bakers

Why AI failures propagate — and why the fix is checkpoints, not better models