Overlap Ratio
Don't just benchmark the model. Benchmark the pipeline.
Model benchmarks measure models in isolation, but we deploy models in pipelines. The overlap ratio measures whether your verification gates are doing independent work or just running the same check twice.
0.118 software pipeline omega · 3 medical imaging models · 4 gate types
This presentation introduces the overlap ratio: a single number that tells you whether your AI verification pipeline is doing independent work or wasting effort on redundant checks. It draws on empirical data from a software delivery pipeline and cross-validates against medical imaging experiments.
The argument proceeds in four stages:
This presentation builds on three prior publications:
The overlap ratio operationalizes one of the four properties from Trust Topology, making it concrete and measurable for practitioners.
Michael Rothrock is a software engineering leader with 35 years of experience building trusted systems. This research documents patterns discovered through daily use of autonomous AI agents across 8 concurrent projects.
The Problem
We benchmark models in isolation but deploy them in pipelines. The quality of the final artifact depends on the arrangement, not just the engine.
Model accuracy on benchmarks. Output quality in isolation. Single-turn correctness. The engine.
Multi-stage pipelines. Verification gates. Human escalation. Orchestrated workflows. The car.
"We measure engine quality but not the quality of the car."
Model performance matters. The existing benchmarks measure model output quality in a variety of ways. However, when we deploy models, we orchestrate them into a workflow. The quality of intermediate artifacts is relevant, but the thing we really care about is the quality of the final artifact. A better measure for practitioners would focus on how the agent's work composes and contributes to that end result.
In practice, work flows through a series of stages that produce intermediate results. Every stage is verified by gates that either pass, require a retry, or escalate to a human for a decision. Each gate sees the work as it progresses and applies specific, state-appropriate checks before errors propagate and get magnified by the rest of the chain.
Decomposing work into incremental stages with checkpoints is generally a good practice—we've been doing this long before agents. But how do we know that adding a new check actually adds value? How do we decide what needs more coverage versus what's good enough?
Intuitively, it's pretty clear: we don't need a new check that only verifies the same thing as an existing one. We can talk about defense in depth, where we verify something from multiple angles. But if two tests are identical, they are wasteful redundancy. What's missing is a way to measure this: a number that tells you whether your verification infrastructure is doing independent work or repeating itself.
The Intuition
The value of a second check depends entirely on whether it catches something the first one doesn't.
High overlap. Both guards check the same IDs. The second guard adds nothing. You're paying for the same check twice.
Low overlap. One checks IDs, one checks packages. They reject different things. Each guard earns their place.
"You need a different kind of gate, not more passes through the same one."
Imagine two security guards at a building. If they both check IDs, that second guard isn't adding anything. But if one checks IDs and one checks packages, they have no overlap. They catch different things.
This is intuitively clear, but we can also put a number on this and reason about it empirically. Two checks that reject identical things have 100% overlap, or a score of 1. Two checks that reject entirely different things have 0.
In practice, complex gates might reject something for a blend of tests that trigger past a threshold. Two gates end up with some overlap: they reject some of the same things, perhaps for different reasons. The overlap can be any value from 0 to 1.
We'll talk about this quite a bit later, so instead of constantly saying "overlap ratio," we'll just call it omega, or ω.
The inference-scaling literature hits a version of this problem. Brown et al.1 found that common methods for picking correct solutions from many samples, like majority voting and reward models, plateau beyond several hundred samples. The framework predicts this. When your verification signals have high overlap, more samples cannot help. You need a different kind of gate, not more passes through the same one.
1 Brown et al., "Large Language Monkeys: Scaling Inference Compute with Repeated Sampling," 2024. arxiv.org/abs/2407.21787
The Metric
ω = shared catchestotal unique catches
ω = 0
Perfectly complementary
Every gate catches unique errors
ω = 1
Perfectly redundant
Every gate catches the same errors
Omega is the count of things caught by multiple gates divided by the total things caught by all gates. If every gate catches the same things, omega is 1. If each gate catches something unique, omega is 0.
You can measure this two ways: you can either consider all the catches across all the gates in your entire pipeline, or you can describe it for a subset. It can work for just two gates if you want to zoom in on one particular part of your pipeline.
This lets you objectively describe how much overlap your pipeline has overall. Or you can see how much one section has. This informs where you might place a new gate because there is too much overlap.
As you'd expect, omega changes with the choice of gates. But here's something interesting: it also changes with the choice of model.
A pipeline's omega is defined by what its gates catch. A system designer can judge gate overlap as they chain things together. But what the gates actually catch in practice is also driven by model output.
Given constant gates, you can use omega as a model benchmark. A strong model performs well across a variety of challenges and fails narrowly. Gates pick up specific edge cases. A weak model fails in ways that are broadly wrong and are observable from a variety of perspectives, seen by many gates.
A weak model makes the same fundamental mistakes everywhere. A strong model fails in isolated ways that only certain gates can see.
Is omega = 0 the ideal? Not necessarily. Some overlap provides a safety net. If a stochastic verifier misses something it usually catches, a second gate covering the same ground catches it instead. The question is how much overlap is useful redundancy versus waste. In my own practice, anything above 0.3 suggests that you're paying for coverage you might already have.
The Evidence
Constant verification. Variable model.
Omega changes with model strength.
"A strong model fails in narrow ways. A weak model makes the same mistakes everywhere."
The experiment ran the same verification gates across three models of various strength on radiological images. The results:
| Model | Omega | Story |
|---|---|---|
| Bosma22b | 0.125 | Each gate catches almost entirely unique errors |
| MONAI | 0.312 | Some overlap, but each gate still earns its place |
| TotalSegmentator | 0.767 | Gates are mostly seeing the same failures |
Model performance is like a shotgun. A strong model concentrates its output in a small area, and the gates pick up specific edge cases. A weak model sprays output and issues across a larger cone, hitting most of the gates.
Bosma22b is the strongest model here, well suited to the task. The pipeline shows low omega because each gate is doing unique work. TotalSegmentator shows high omega because the gates are catching the same issues over and over.
The implication: given constant gates, omega doubles as a model benchmark. It tells you not just how often a model fails, but how it fails, either in narrow, specific ways (low omega) or in broad, repeated ways (high omega).
Gate Architecture
Each gate has a deterministic side and a stochastic side. Together they form the tri-state gate: pass, fail, or human.
Hard guarantees. Does it compile? Is the detected mass inside the organ boundary? Do the references exist? Pass or fail.
LLM opinion. Judgement calls with no definitive answer. Is the code quality acceptable? Is the segmentation plausible? Are the references relevant? Pass, fail, or escalate.
Pass → proceed · Fail → retry · Escalate → human decides
The deterministic verifiers check things we know are true: does it pass lint rules? Is the detected mass inside the organ boundary? These are hard guarantees about the output. They either hold or they don't.
But some deterministic checks use empirical thresholds. Is the organ volume between 10 and 150cc? Is the centroid trajectory smooth enough? The gate itself is a pure function (same input, same output), but cases near the threshold are ambiguous. The centroid moved more than typical, but not enough to be implausible. What do you do with that?
This is where deterministic and stochastic verifiers connect. An inconclusive deterministic result is a natural handoff to the stochastic verifier. The deterministic gate narrowed the field—it's not definitely bad—and the LLM reviewer makes the judgment call on whether it's good enough.
LLMs are imperfect verifiers. They may produce false positives, rejecting things that are actually correct. But they see things deterministic checks can't: whether the code is well-structured, whether the approach makes sense, whether the output matches the intent.
Some pipelines can tolerate a certain amount of uncertainty. A flag for repeated code is not a showstopper, but too much repetition indicates a real problem. An organ detected slightly outside anatomical norms is exceptional but not impossible.
Like the deterministic verifier, the LLM also has an escalation path if it is uncertain: the human.
People normally assume gates are pass/fail, but the real benefit comes from pass/fail/human. Deterministic tests provide hard guarantees and lead to hard fail. Stochastic gates are by definition based on probability, so uncertain decisions can be passed up to a human for final judgment.
The benefit of this third state is that you can handle them in two ways: accept them and escalate, or loosen the gate and let downstream gates catch the edge cases.
However, the real value comes from the filtering effect: the pipeline handles clear cases automatically. The human only sees things where they add unique expertise.
Diagnostics
High omega. Gates catching the same things. Add a different kind of check.
Escaped errors. Not enough coverage. Check the failure modes you're seeing in the final artifact.
Frequent escalations. A gate at its edge. Clustered escalations point at a high-value gate.
There's an old saying in engineering: you can have something high quality, built quickly, and within budget. Choose two. This applies to gate design as well. If we have to choose which gate to add to fit within our constraints, what gets the best return? Omega tells us: wherever your existing gates have the least coverage.
But omega only tells us about redundancy between existing gates. It says nothing about errors that escape the entire pipeline. How do we detect those?
High omega means your gates are catching the same failures. Either the gates are too similar (design problem) or the model is weak enough that it fails broadly (model problem). Both are actionable: redesign the gate or swap the model.
Models generally fail in three ways: omissions, incorrectness, or incoherence. You can look at the type of failures appearing in the final product and reason about what gates might catch that kind of error. If you discover the model is consistently missing something, it's obvious to add a check for that thing.
A specific example: the Medical Imaging study mentioned earlier has one gate that checks centroid smoothness. It is the only test for that type of violation. Remove it, and a specific kind of error spikes.
A large number of escalations from one gate indicates that it is on the edge of its ability. If those escalations are consistently about the same thing, they are a signpost pointing at a high-value gate. This is the tri-state gate doing its job: flagging exactly where you need to invest.
Getting Started
However you prompt an agent, you have some form of stages and verification. Even ad hoc corrections in chat logs count.
Explicit gates: Code compiled? Lint passed? Review approved? These are gates. Record what each one catches.
Implicit gates: You told the LLM to fix something it made? That's a failed gate. Use an LLM to classify your chat logs. Every correction is data.
End-user gates: Reader says "stop sending LLM slop"? The final verification step just failed. That's an escaped error.
"Add new coverage, check your omega, repeat."
You can calculate omega on your pipeline today. However you prompt an agent, you have some form of stages and gates. Your pipeline begins with a job to do: you provide a specification and tell it to build, or you provide an image and ask it to classify, or you explain your task and have it write a document.
Some pipelines have explicit intermediate artifacts—and I'd encourage you to do that—but even the simplest has some kind of final verification step. You tell it to draft an email, you send it, the reader asks you to stop sending LLM slop: verification failed. (And, really, you should also check this yourself before you click send.)
Code compiled? Verification passed. You try the code and it corrupts your data? Failed. It tells you an image of a prostate is a kidney? Failed.
Count the failures, calculate omega.
While structured gates are ideal, even ad hoc corrections count. Use an LLM to examine chat logs and classify each prompt, looking for prompts that ask the LLM to fix something it made. A correction like this is a gate that failed.
I provided an open source tool that analyzes Claude Code session logs. It uses LLM classification (with your API key) to extract gate decisions and error types from your Claude Code session logs. It discovers review gates automatically, classifies their outcomes, and categorizes the error types. Then it calculates your omega.
More gates are better, though at some point you get diminishing returns. The important refinement is to add new coverage, check your omega to validate, and repeat. If omega goes up as you add gates, your new gates are redundant. If it goes down, they're catching new things.
How many tasks do you need before omega stabilizes? In practice, the pairwise numbers are consistent after about 100 rejected tasks. With fewer than that, individual rejections swing the ratio significantly. If you're just starting out, don't over-index on the exact number, look at the trend instead.
The Data
389 rejected tasks. 4 gate types. 3 out of 702 classified rejections were truly redundant.
Software pipeline 0.118 ≈ Bosma22b 0.125 · Cross-domain, same pattern.
The analysis ran across a software delivery pipeline with four gate types: plan review, design review, agentic cross-artifact code review, and single file code review. These are spread across distinct stages, so a plan review gate and a code review gate are looking at completely different artifacts.
The global omega across 389 rejected tasks: 0.118. Very low. The gates are doing almost entirely independent work. (In the spirit of full disclosure: most of this work is done with Opus 4.5 or 4.6, so model strength also contributes to low omega.)
| Gate Pair | Shared | Union | Omega |
|---|---|---|---|
| code review ↔ plan review | 36 | 159 | 0.226 |
| design review ↔ plan review | 10 | 384 | 0.026 |
| code review ↔ design review | 2 | 281 | 0.007 |
Code review and plan review have the most overlap at 0.226. This makes sense as sometimes a plan-level issue resurfaces at the code level because the fix didn't fully address it. But even here, it's modest. Design review is nearly independent from everything else: 0.026 and 0.007.
The most telling result: when looking at the error classes of shared rejections—cases where two gates rejected the same task for the same type of error—there were exactly 3 instances out of 702 classified rejections. Almost zero true redundancy. The rest were complementary: same task, different problems caught by each gate.
The software pipeline omega (0.118) and the strong prostate model (0.125) are nearly identical despite being completely different domains. The weak multi-organ model (0.767) is six times higher. Same verification architecture, same pattern: the topology determines coverage quality, and model quality determines where on the spectrum you land.
The Punchline
Two models can have identical accuracy scores but very different omegas, because one fails in ways your gates can see and the other fails in ways they can't.
Model benchmarks measure the model in isolation.
Omega measures the model in the context of your verification infrastructure.
"We measure how good the engine is, but not how good the car is."
There's an asymmetry worth making explicit: model benchmarks measure the model in isolation. Omega measures the model in the context of your verification infrastructure. Two models can have identical accuracy scores but very different omegas, because one fails in ways your gates can see and the other fails in ways they can't.
This points at something the industry hasn't quite internalized yet. We have extensive benchmarks for models. We have none for pipelines. We measure how good the engine is, but not how good the car is.
Omega is a step toward benchmarking the pipeline itself. It doesn't tell you everything. It can't see errors that escape entirely, and it can't tell you whether your gates are catching the right things. But it tells you whether your verification infrastructure is doing independent work or just running the same check multiple times. And it gives you a concrete number to track as you iterate.
You can track omega over time as you improve your models. If you swap to a stronger model and omega drops, the new model is failing in more specific ways. Your verification infrastructure is doing more differentiated work. That's a signal that both the model and the pipeline are improving together.