What if adding more reviewers doesn't actually help?
AI generated code is flooding our pipelines. It has bugs, just like any code, but the volume means reviewers get overwhelmed. Humans don't scale, they get fatigued, and they miss things. And, really, no one wants to review slop.
Our instinct is to add more reviewers. The issue with this is that you start to get diminishing returns. More reviewers end up looking at the same things.
So where can you add gates to really make an impact? We learn early on that the cost of a bug compounds as it progresses through the pipeline uncaught. An architecture bug caught in that phase is cheap and easy to fix. If it isn't found until deployment, remediation is enormously expensive.
I've been delivering code with a structured, agentic pipeline for a while now. It has distinct phases, with gates between each phase. Looking at the actual defect data from my own pipeline, I found that a check added at a different stage gave far better impact than just adding more checks at the same stage.
This is because all the checks at one stage are looking at the same artifact. They overlap on the problem space. They don't have the ability to see different perspectives. However, checks at different stages are independent. The artifacts in earlier stages are like a viewport from a completely different angle, and that's where the gates can see unique issues.
To see if this held outside software, I validated it against a medical imaging pipeline with multiple stages and gates. 96% of what stage 4 rejected was invisible to stage 2. Different stages catch fundamentally different things.
Checks added to a single stage have diminishing returns. Checks added across stages actually multiply the effect.
"Adding checks always helps" is obvious. Knowing WHICH check to add next isn't. The answer: the one with the lowest redundancy. The one catching things nothing else catches. If you already have a linter, adding a second linter gives you almost nothing. Adding a design review catches a completely different class of error.
See the full analysis across 5,109 quality checks →