Thinking out loud about AI agent reliability, verification design, and what the data actually says.
Teams keep asking how to catch weird behavior from production AI agents. The better question: do you have a process? Map it once with a big agent, then run targeted agents against the steps you already run.
Fred Brooks said it in 1975, and it's more true now than ever. AI agents drive the cost of code to zero, which finally makes the first version disposable. The code is disposable. The understanding is not.
Is model choice the most important thing in getting good results from agents? For me, no. The harness around the model is doing the work. Each stage produces a verification surface.
I open-sourced a small ML utility 11 days ago. 3,000 people are using it. ~100 pulls are coming from teams extending it for NVIDIA Blackwell silicon. If you're sitting on something useful, share it.
Does the verification topology generalize? I ran the same 11 gates unchanged across three medical imaging models. Rejection rates scaled cleanly with model weakness: 4.8%, 11%, 93%.
Define a model in JSON, train on your Mac, ship to a cloud GPU. No code changes. Open source: mixlab.
Good managers delegate tasks. Great managers delegate outcomes. The same is true for managing AI agents. Define the gates that do the managing, while you do the leading.
Tasks that fail early and get revised have half the downstream failure rate. The most expensive thing a pipeline can do is let bad work through early.
I build by asking questions, not by issuing commands. Four questions from my Claude Code logs that make the biggest difference.
As agents write more code, we're trading tech debt for cognitive debt. Two strategies to stay connected to code you didn't write.
Low overlap doesn't mean you're covered. Map your error types to your gates — the empty cells are where your next investment should go.
Adding more reviewers doesn't help if they're all looking at the same thing. Checks at different stages catch fundamentally different errors.
Coding agents don't make random mistakes. 91% of failures are predictable — systematic errors and omissions that compound through every stage of the pipeline.
A coding agent issued a terraform destroy in dev. The fix wasn't better reviewers — it was a deterministic gate that routes only what matters to humans.
A hallucinated company name in a marketing report. The root cause wasn't the model — it was a missing gate early in the pipeline.
Why AI failures propagate — and why the fix is checkpoints, not better models