Overlap Ratio
Don't just benchmark the model. Benchmark the pipeline.
Model benchmarks measure models in isolation, but we deploy models in pipelines. The overlap ratio measures whether your verification gates are doing independent work or just running the same check twice.
0.118 software pipeline omega · 3 medical imaging models · 4 gate types
This presentation introduces the overlap ratio: a single number that tells you whether your AI verification pipeline is doing independent work or wasting effort on redundant checks. It draws on empirical data from a software delivery pipeline and cross-validates against medical imaging experiments.
The argument proceeds in four stages:
This presentation builds on three prior publications:
The overlap ratio operationalizes one of the four properties from Trust Topology, making it concrete and measurable for practitioners.
Michael Rothrock is a software engineering leader with 35 years of experience building trusted systems. This research documents patterns discovered through daily use of autonomous AI agents across 8 concurrent projects.
The Problem
We benchmark models in isolation but deploy them in pipelines. The quality of the final artifact depends on the arrangement, not just the engine.
Model accuracy on benchmarks. Output quality in isolation. Single-turn correctness. The engine.
Multi-stage pipelines. Verification gates. Human escalation. Orchestrated workflows. The car.
"We measure engine quality but not the quality of the car."
Model performance matters. The existing benchmarks measure model output quality in a variety of ways. However, when we deploy models, we orchestrate them into a workflow. The quality of intermediate artifacts is relevant, but the thing we really care about is the quality of the final artifact. A better measure for practitioners would focus on how the agent's work composes and contributes to that end result.
In practice, work flows through a series of stages that produce intermediate results. Every stage is verified by gates that either pass, require a retry, or escalate to a human for a decision. Each gate sees the work as it progresses and applies specific, state-appropriate checks before errors propagate and get magnified by the rest of the chain.
Decomposing work into incremental stages with checkpoints is generally a good practice—we've been doing this long before agents. But how do we know that adding a new check actually adds value? How do we decide what needs more coverage versus what's good enough?
Intuitively, it's pretty clear: we don't need a new check that only verifies the same thing as an existing one. We can talk about defense in depth, where we verify something from multiple angles. But if two tests are identical, they are wasteful redundancy. What's missing is a way to measure this: a number that tells you whether your verification infrastructure is doing independent work or repeating itself.
The Intuition
The value of a second check depends entirely on whether it catches something the first one doesn't.
High overlap. Both guards check the same IDs. The second guard adds nothing. You're paying for the same check twice.
Low overlap. One checks IDs, one checks packages. They reject different things. Each guard earns their place.
"You need a different kind of gate, not more passes through the same one."
Imagine two security guards at a building. If they both check IDs, that second guard isn't adding anything. But if one checks IDs and one checks packages, they have no overlap. They catch different things.
This is intuitively clear, but we can also put a number on this and reason about it empirically. Two checks that reject identical things have 100% overlap, or a score of 1. Two checks that reject entirely different things have 0.
In practice, complex gates might reject something for a blend of tests that trigger past a threshold. Two gates end up with some overlap: they reject some of the same things, perhaps for different reasons. The overlap can be any value from 0 to 1.
We'll talk about this quite a bit later, so instead of constantly saying "overlap ratio," we'll just call it omega, or ω.
The inference-scaling literature hits a version of this problem. Brown et al.1 found that common methods for picking correct solutions from many samples, like majority voting and reward models, plateau beyond several hundred samples. The framework predicts this. When your verification signals have high overlap, more samples cannot help. You need a different kind of gate, not more passes through the same one.
1 Brown et al., "Large Language Monkeys: Scaling Inference Compute with Repeated Sampling," 2024. arxiv.org/abs/2407.21787
The Metric
ω = shared catchestotal unique catches
ω = 0
Perfectly complementary
Every gate catches unique errors
ω = 1
Perfectly redundant
Every gate catches the same errors
In a multi-stage pipeline, a task flows through stages: plan, design, code. Each stage has its own review gate checking a different artifact that is produced for a task at that stage. A task can pass plan review, then fail code review. Or it can fail plan review, get revised, pass, and then fail design review for a completely different reason. The same task produces different artifacts that are checked at multiple points in its lifecycle.
The overlap ratio asks: when two different gates both rejected artifacts from the same task, were they catching the same problem or different ones? Count the tasks caught by more than one gate, divide by the total caught by all gates. If every gate catches the same tasks, omega is 1. If each gate catches unique tasks, omega is 0.
You can measure this across your entire pipeline for a global number, or zoom into a specific pair of gates. This tells you where you have redundancy and where a new gate might add the most value.
Omega works across two fundamentally different pipeline shapes:
Sequential with revision (e.g., software delivery): a task flows through stages, producing artifacts that can be revised after each rejection. When two gates both reject artifacts from the same task, they saw different versions of the work at different stages. A shared catch means either the revision didn't fully address the issue, or the second gate found a different problem that only became visible after the first was fixed.
Sequential without revision (e.g., medical imaging): the pipeline has distinct stages (gland segmentation, then lesion detection) but no revision cycle. Within each stage, multiple quality checks run in parallel against the same artifact. A segmentation model produces a mask, and then volume checks, centroid trajectory checks, and boundary checks all evaluate that same mask simultaneously. A shared catch within a stage means genuine redundancy: two gates flagged the exact same artifact for overlapping reasons. A shared catch across stages is more like the software case: different artifacts, different perspectives, but without the revision confound.
The within-stage parallel case is a cleaner measurement of redundancy, because there's no confound from revision or different artifacts. The sequential-with-revision case is more common in practice and still informative, but shared catches carry a subtly different meaning.
As you'd expect, omega changes with the choice of gates. But here's something interesting: it also changes with the choice of model.
A pipeline's omega is defined by what its gates catch. A system designer can judge gate overlap as they chain things together. But what the gates actually catch in practice is also driven by model output.
Given constant gates, you can use omega as a model benchmark. A strong model performs well across a variety of challenges and fails narrowly. Gates pick up specific edge cases. A weak model fails in ways that are broadly wrong and are observable from a variety of perspectives, seen by many gates.
A weak model makes the same fundamental mistakes everywhere. A strong model fails in isolated ways that only certain gates can see.
Is omega = 0 the ideal? Not necessarily. Some overlap provides a safety net. If a stochastic verifier misses something it usually catches, a second gate covering the same ground catches it instead. The question is how much overlap is useful redundancy versus waste. In my own practice, anything above 0.3 suggests that you're paying for coverage you might already have.
The Evidence
Constant verification. Variable model.
Omega changes with model strength.
"A strong model fails in narrow ways. A weak model makes the same mistakes everywhere."
The medical imaging pipeline has two stages (gland segmentation, then lesion detection) but no revision cycle. Within each stage, multiple quality checks run in parallel against the same artifact. A segmentation model produces a mask, and then volume, centroid, boundary, and smoothness checks all evaluate that single artifact simultaneously.
When two gates within the same stage both flag a case, it's genuine redundancy: both flagged a problem in the same output. With a strong model, cross-stage overlap is rare: only 2 out of 143 rejections were shared across stages for Bosma22b (N=1,500), because the stages check fundamentally different artifacts.
The experiment ran the same verification gates across three models of various strength on radiological images. The results:
| Model | Omega | Story |
|---|---|---|
| Bosma22b | 0.125 | Each gate catches almost entirely unique errors |
| MONAI | 0.312 | Some overlap, but each gate still earns its place |
| TotalSegmentator | 0.767 | Gates are mostly seeing the same failures |
Model performance is like a shotgun. A strong model concentrates its output in a small area, and the gates pick up specific edge cases. A weak model sprays output and issues across a larger cone, hitting most of the gates.
Bosma22b is the strongest model here, well suited to the task. The pipeline shows low omega because each gate is doing unique work. TotalSegmentator shows high omega because the gates are catching the same issues over and over.
The implication: given constant gates, omega doubles as a model benchmark. It tells you not just how often a model fails, but how it fails, either in narrow, specific ways (low omega) or in broad, repeated ways (high omega).
Gate Architecture
Each gate has a deterministic side and a stochastic side. Together they form the tri-state gate: pass, fail, or human.
Hard guarantees. Does it compile? Is the detected mass inside the organ boundary? Do the references exist? Pass or fail.
LLM opinion. Judgement calls with no definitive answer. Is the code quality acceptable? Is the segmentation plausible? Are the references relevant? Pass, fail, or escalate.
Pass → proceed · Fail → retry · Escalate → human decides
The deterministic verifiers check things we know are true: does it pass lint rules? Is the detected mass inside the organ boundary? These are hard guarantees about the output. They either hold or they don't.
But some deterministic checks use empirical thresholds. Is the organ volume between 10 and 150cc? Is the centroid trajectory smooth enough? The gate itself is a pure function (same input, same output), but cases near the threshold are ambiguous. The centroid moved more than typical, but not enough to be implausible. What do you do with that?
This is where deterministic and stochastic verifiers connect. An inconclusive deterministic result is a natural handoff to the stochastic verifier. The deterministic gate narrowed the field—it's not definitely bad—and the LLM reviewer makes the judgment call on whether it's good enough.
LLMs are imperfect verifiers. They may produce false positives, rejecting things that are actually correct. But they see things deterministic checks can't: whether the code is well-structured, whether the approach makes sense, whether the output matches the intent.
Some pipelines can tolerate a certain amount of uncertainty. A flag for repeated code is not a showstopper, but too much repetition indicates a real problem. An organ detected slightly outside anatomical norms is exceptional but not impossible.
Like the deterministic verifier, the LLM also has an escalation path if it is uncertain: the human.
People normally assume gates are pass/fail, but the real benefit comes from pass/fail/human. Deterministic tests provide hard guarantees and lead to hard fail. Stochastic gates are by definition based on probability, so uncertain decisions can be passed up to a human for final judgment.
The benefit of this third state is that you can handle them in two ways: accept them and escalate, or loosen the gate and let downstream gates catch the edge cases.
However, the real value comes from the filtering effect: the pipeline handles clear cases automatically. The human only sees things where they add unique expertise.
Diagnostics
High omega. Gates catching the same things. Add a different kind of check.
Escaped errors. Not enough coverage. Check the failure modes you're seeing in the final artifact.
Frequent escalations. A gate at its edge. Clustered escalations point at a high-value gate.
There's an old saying in engineering: you can have something high quality, built quickly, and within budget. Choose two. This applies to gate design as well. If we have to choose which gate to add to fit within our constraints, what gets the best return? Omega tells us: wherever your existing gates have the least coverage.
But omega only tells us about redundancy between existing gates. It says nothing about errors that escape the entire pipeline. How do we detect those?
High omega means your gates are catching the same failures. Either the gates are too similar (design problem) or the model is weak enough that it fails broadly (model problem). Both are actionable: redesign the gate or swap the model.
Models generally fail in three ways: omissions, incorrectness, or incoherence. You can look at the type of failures appearing in the final product and reason about what gates might catch that kind of error. If you discover the model is consistently missing something, it's obvious to add a check for that thing.
A specific example: the Medical Imaging study mentioned earlier has one gate that checks centroid smoothness. It is the only test for that type of violation. Remove it, and a specific kind of error spikes.
A large number of escalations from one gate indicates that it is on the edge of its ability. If those escalations are consistently about the same thing, they are a signpost pointing at a high-value gate. This is the tri-state gate doing its job: flagging exactly where you need to invest.
Getting Started
However you prompt an agent, you have some form of stages and verification. Even ad hoc corrections in chat logs count.
Explicit gates: Code compiled? Lint passed? Review approved? These are gates. Record what each one catches.
Implicit gates: You told the LLM to fix something it made? That's a failed gate. Use an LLM to classify your chat logs. Every correction is data.
End-user gates: Reader says "stop sending LLM slop"? The final verification step just failed. That's an escaped error.
"Add new coverage, check your omega, repeat."
You can calculate omega on your pipeline today. The key is tracking rejections per task across gates. For each task that flows through your pipeline, record which gates rejected it and why.
The task-level omega is straightforward: how many tasks were rejected by more than one gate type, divided by the total rejected tasks? In my pipeline, 46 out of 389 rejected tasks had artifacts rejected by multiple gates. That's the headline number.
But you can go deeper. For those 46 tasks, were the gates catching the same issue or different issues? This is catch-level omega: compare the specific problems each gate flagged using text similarity. In my data, only 3 out of 702 classified catches were truly redundant. The rest were complementary: same task, different problems found by each gate.
You don't need a complex setup. However you prompt an agent, you have some form of stages and verification. Code compiled? Verification passed. You try the code and it corrupts your data? Failed. You tell it to draft an email, you send it, the reader asks you to stop sending LLM slop: verification failed. Count the failures per gate per task, calculate omega.
While structured gates are ideal, even ad hoc corrections count. Use an LLM to examine chat logs and classify each prompt, looking for prompts that ask the LLM to fix something it made. A correction like this is a gate that failed.
I provided an open source tool that analyzes Claude Code session logs. It uses LLM classification (with your API key) to extract gate decisions and error types from your Claude Code session logs. It discovers review gates automatically, classifies their outcomes, and categorizes the error types. Then it calculates your omega.
More gates are better, though at some point you get diminishing returns. The important refinement is to add new coverage, check your omega to validate, and repeat. If omega goes up as you add gates, your new gates are redundant. If it goes down, they're catching new things.
How many tasks do you need before omega stabilizes? In practice, the pairwise numbers are consistent after about 100 rejected tasks. With fewer than that, individual rejections swing the ratio significantly. If you're just starting out, don't over-index on the exact number, look at the trend instead.
The Data
389 rejected tasks. 4 gate types. 3 out of 702 classified rejections were truly redundant.
Software pipeline 0.118 ≈ Bosma22b 0.125 · Cross-domain, same pattern.
The analysis ran across a software delivery pipeline with four gate types: plan review, design review, agentic cross-artifact code review, and single file code review. These are spread across distinct stages, so a plan review gate and a code review gate are looking at completely different artifacts.
The global omega across 389 rejected tasks: 0.118. Very low. The gates are doing almost entirely independent work. (In the spirit of full disclosure: most of this work is done with Opus 4.5 or 4.6, so model strength also contributes to low omega.)
Not all gates work equally hard. Looking at the per-gate rejection rates across tasks reveals which gates are doing the heavy lifting:
| Gate | Tasks Seen | Tasks Rejected | Rejection Rate |
|---|---|---|---|
| Plan Review | 214 | 153 | 71.5% |
| Design Review | 580 | 241 | 41.6% |
| Code Review | 148 | 42 | 28.4% |
Plan review rejects nearly three quarters of the initial artifacts for the tasks it sees. By the time work reaches code review, only 28% gets rejected. This is the cascade effect: upstream gates filter out problems early, so downstream gates see cleaner work.
This is useful in several ways. A gate with a high rejection rate is doing the most filtering. That's either a sign that it's catching problems at the cheapest stage (good pipeline design), or that the stage feeding it is producing low-quality output (model or prompt problem). A gate with a very low rejection rate might be redundant, or it might be the only gate catching a rare but critical class of error. The rejection rate alone doesn't tell you which, but combined with omega it does: low rejection rate + low pairwise omega means the gate is catching something unique that nothing else sees.
| Gate Pair | Shared | Union | Omega |
|---|---|---|---|
| code review ↔ plan review | 36 | 159 | 0.226 |
| design review ↔ plan review | 10 | 384 | 0.026 |
| code review ↔ design review | 2 | 281 | 0.007 |
Code review and plan review have the most overlap at 0.226. Thirty six tasks had artifacts that were rejected at both stages. This makes sense as sometimes a plan-level issue resurfaces at the code level because the fix didn't fully address it. But even here, it's modest. Design review is nearly independent from everything else: 0.026 and 0.007.
The most telling result: when looking at the error classes of shared rejections—cases where two gates rejected artifacts from the same task for the same type of error—there were exactly 3 instances out of 702 classified rejections. Almost zero true redundancy. The rest were complementary: same task, different problems caught by each gate.
The software pipeline omega (0.118) and the strong prostate model (0.125) are nearly identical despite being completely different domains. The weak multi-organ model (0.767) is six times higher. Same verification architecture, same pattern: the topology determines coverage quality, and model quality determines where on the spectrum you land.
The Punchline
Two models can have identical accuracy scores but very different omegas, because one fails in ways your gates can see and the other fails in ways they can't.
Model benchmarks measure the model in isolation.
Omega measures the model in the context of your verification infrastructure.
"We measure how good the engine is, but not how good the car is."
There's an asymmetry worth making explicit: model benchmarks measure the model in isolation. Omega measures the model in the context of your verification infrastructure. Two models can have identical accuracy scores but very different omegas, because one fails in ways your gates can see and the other fails in ways they can't.
This points at something the industry hasn't quite internalized yet. We have extensive benchmarks for models. We have none for pipelines. We measure how good the engine is, but not how good the car is.
Omega is a step toward benchmarking the pipeline itself. It doesn't tell you everything. It can't see errors that escape entirely, and it can't tell you whether your gates are catching the right things. But it tells you whether your verification infrastructure is doing independent work or just running the same check multiple times. And it gives you a concrete number to track as you iterate.
You can track omega over time as you improve your models. If you swap to a stronger model and omega drops, the new model is failing in more specific ways. Your verification infrastructure is doing more differentiated work. That's a signal that both the model and the pipeline are improving together.