Thinking out loud about AI agent reliability, verification design, and what the data actually says.
Define a model in JSON, train on your Mac, ship to a cloud GPU. No code changes. Open source: mixlab.
Tasks that fail early and get revised have half the downstream failure rate. The most expensive thing a pipeline can do is let bad work through early.
I build by asking questions, not by issuing commands. Four questions from my Claude Code logs that make the biggest difference.
As agents write more code, we're trading tech debt for cognitive debt. Two strategies to stay connected to code you didn't write.
Low overlap doesn't mean you're covered. Map your error types to your gates — the empty cells are where your next investment should go.
Adding more reviewers doesn't help if they're all looking at the same thing. Checks at different stages catch fundamentally different errors.
Coding agents don't make random mistakes. 87% of failures are predictable — omissions and systematic errors that compound through every stage of the pipeline.
A coding agent issued a terraform destroy in dev. The fix wasn't better reviewers — it was a deterministic gate that routes only what matters to humans.
A hallucinated company name in a marketing report. The root cause wasn't the model — it was a missing gate early in the pipeline.
Why AI failures propagate — and why the fix is checkpoints, not better models