Overlap Ratio

X-Raying Your AI Pipeline

Don't just benchmark the model. Benchmark the pipeline.

Model benchmarks measure models in isolation, but we deploy models in pipelines. The overlap ratio measures whether your verification gates are doing independent work or just running the same check twice.

0.118 software pipeline omega · 3 medical imaging models · 4 gate types

The Narrative Arc

This presentation introduces the overlap ratio: a single number that tells you whether your AI verification pipeline is doing independent work or wasting effort on redundant checks. It draws on empirical data from a software delivery pipeline and cross-validates against medical imaging experiments.

The argument proceeds in four stages:

The problem: Model benchmarks measure the wrong thing. We benchmark models in isolation but deploy them in pipelines. The quality of the final artifact depends on the pipeline, not just the model.
The concept: Two gates checking the same thing is waste. The overlap ratio quantifies how much of your verification is redundant.
The evidence: Same gates, three models, completely different omegas. Cross-domain validation from software delivery and medical imaging.
The practice: You can compute omega today. Three signals tell you where your pipeline is weak.

Prior Work

This presentation builds on three prior publications:

543 Hours of Autonomous AI, the workflow methodology and operational patterns
Gate Analysis, the empirical data on error taxonomy, decomposition, and gate specificity
Trust Topology, the theoretical framework that explains why gate arrangements matter

The overlap ratio operationalizes one of the four properties from Trust Topology, making it concrete and measurable for practitioners.

About

Michael Rothrock is a software engineering leader with 35 years of experience building trusted systems. This research documents patterns discovered through daily use of autonomous AI agents across 8 concurrent projects.

The Problem

Model benchmarks measure
the wrong thing.

We benchmark models in isolation but deploy them in pipelines. The quality of the final artifact depends on the arrangement, not just the engine.

What We Measure

Model accuracy on benchmarks. Output quality in isolation. Single-turn correctness. The engine.

What We Deploy

Multi-stage pipelines. Verification gates. Human escalation. Orchestrated workflows. The car.

"We measure engine quality but not the quality of the car."

The Gap Between Benchmarks and Deployment

Model performance matters. The existing benchmarks measure model output quality in a variety of ways. However, when we deploy models, we orchestrate them into a workflow. The quality of intermediate artifacts is relevant, but the thing we really care about is the quality of the final artifact. A better measure for practitioners would focus on how the agent's work composes and contributes to that end result.

Stages and Gates

In practice, work flows through a series of stages that produce intermediate results. Every stage is verified by gates that either pass, require a retry, or escalate to a human for a decision. Each gate sees the work as it progresses and applies specific, state-appropriate checks before errors propagate and get magnified by the rest of the chain.

Decomposing work into incremental stages with checkpoints is generally a good practice—we've been doing this long before agents. But how do we know that adding a new check actually adds value? How do we decide what needs more coverage versus what's good enough?

The Missing Metric

Intuitively, it's pretty clear: we don't need a new check that only verifies the same thing as an existing one. We can talk about defense in depth, where we verify something from multiple angles. But if two tests are identical, they are wasteful redundancy. What's missing is a way to measure this: a number that tells you whether your verification infrastructure is doing independent work or repeating itself.

The Intuition

Two guards checking IDs
is one guard.

The value of a second check depends entirely on whether it catches something the first one doesn't.

High overlap. Both guards check the same IDs. The second guard adds nothing. You're paying for the same check twice.

Low overlap. One checks IDs, one checks packages. They reject different things. Each guard earns their place.

"You need a different kind of gate, not more passes through the same one."

The Security Guard Problem

Imagine two security guards at a building. If they both check IDs, that second guard isn't adding anything. But if one checks IDs and one checks packages, they have no overlap. They catch different things.

This is intuitively clear, but we can also put a number on this and reason about it empirically. Two checks that reject identical things have 100% overlap, or a score of 1. Two checks that reject entirely different things have 0.

In practice, complex gates might reject something for a blend of tests that trigger past a threshold. Two gates end up with some overlap: they reject some of the same things, perhaps for different reasons. The overlap can be any value from 0 to 1.

We'll talk about this quite a bit later, so instead of constantly saying "overlap ratio," we'll just call it omega, or ω.

Connection to Inference Scaling

The inference-scaling literature hits a version of this problem. Brown et al.¹ found that common methods for picking correct solutions from many samples, like majority voting and reward models, plateau beyond several hundred samples. The framework predicts this. When your verification signals have high overlap, more samples cannot help. You need a different kind of gate, not more passes through the same one.

¹ Brown et al., "Large Language Monkeys: Scaling Inference Compute with Repeated Sampling," 2024. arxiv.org/abs/2407.21787

The Metric

One number.
How redundant are your gates?

ω = shared catchestotal unique catches

ω = 0

Perfectly complementary
Every gate catches unique errors

ω = 1

Perfectly redundant
Every gate catches the same errors

Defining Omega

In a multi-stage pipeline, a task flows through stages: plan, design, code. Each stage has its own review gate checking a different artifact that is produced for a task at that stage. A task can pass plan review, then fail code review. Or it can fail plan review, get revised, pass, and then fail design review for a completely different reason. The same task produces different artifacts that are checked at multiple points in its lifecycle.

The overlap ratio asks: when two different gates both rejected artifacts from the same task, were they catching the same problem or different ones? Count the tasks caught by more than one gate, divide by the total caught by all gates. If every gate catches the same tasks, omega is 1. If each gate catches unique tasks, omega is 0.

You can measure this across your entire pipeline for a global number, or zoom into a specific pair of gates. This tells you where you have redundancy and where a new gate might add the most value.

Two Pipeline Architectures

Omega works across two fundamentally different pipeline shapes:

Sequential with revision (e.g., software delivery): a task flows through stages, producing artifacts that can be revised after each rejection. When two gates both reject artifacts from the same task, they saw different versions of the work at different stages. A shared catch means either the revision didn't fully address the issue, or the second gate found a different problem that only became visible after the first was fixed.

Sequential without revision (e.g., medical imaging): the pipeline has distinct stages (gland segmentation, then lesion detection) but no revision cycle. Within each stage, multiple quality checks run in parallel against the same artifact. A segmentation model produces a mask, and then volume checks, centroid trajectory checks, and boundary checks all evaluate that same mask simultaneously. A shared catch within a stage means genuine redundancy: two gates flagged the exact same artifact for overlapping reasons. A shared catch across stages is more like the software case: different artifacts, different perspectives, but without the revision confound.

The within-stage parallel case is a cleaner measurement of redundancy, because there's no confound from revision or different artifacts. The sequential-with-revision case is more common in practice and still informative, but shared catches carry a subtly different meaning.

As you'd expect, omega changes with the choice of gates. But here's something interesting: it also changes with the choice of model.

Omega as a Model Diagnostic

A pipeline's omega is defined by what its gates catch. A system designer can judge gate overlap as they chain things together. But what the gates actually catch in practice is also driven by model output.

Given constant gates, you can use omega as a model benchmark. A strong model performs well across a variety of challenges and fails narrowly. Gates pick up specific edge cases. A weak model fails in ways that are broadly wrong and are observable from a variety of perspectives, seen by many gates.

A weak model makes the same fundamental mistakes everywhere. A strong model fails in isolated ways that only certain gates can see.

The Practical Range

Is omega = 0 the ideal? Not necessarily. Some overlap provides a safety net. If a stochastic verifier misses something it usually catches, a second gate covering the same ground catches it instead. The question is how much overlap is useful redundancy versus waste. In my own practice, anything above 0.3 suggests that you're paying for coverage you might already have.

The Evidence

Same gates, three models.
Completely different omegas.

Constant verification. Variable model.
Omega changes with model strength.

Bosma22b

0.125

MONAI

0.312

TotalSegmentator

0.767

"A strong model fails in narrow ways. A weak model makes the same mistakes everywhere."

Medical Imaging Experiment

The medical imaging pipeline has two stages (gland segmentation, then lesion detection) but no revision cycle. Within each stage, multiple quality checks run in parallel against the same artifact. A segmentation model produces a mask, and then volume, centroid, boundary, and smoothness checks all evaluate that single artifact simultaneously.

When two gates within the same stage both flag a case, it's genuine redundancy: both flagged a problem in the same output. With a strong model, cross-stage overlap is rare: only 2 out of 143 rejections were shared across stages for Bosma22b (N=1,500), because the stages check fundamentally different artifacts.

The experiment ran the same verification gates across three models of various strength on radiological images. The results:

Model	Omega	Story
Bosma22b	0.125	Each gate catches almost entirely unique errors
MONAI	0.312	Some overlap, but each gate still earns its place
TotalSegmentator	0.767	Gates are mostly seeing the same failures

The Shotgun Metaphor

Model performance is like a shotgun. A strong model concentrates its output in a small area, and the gates pick up specific edge cases. A weak model sprays output and issues across a larger cone, hitting most of the gates.

Bosma22b is the strongest model here, well suited to the task. The pipeline shows low omega because each gate is doing unique work. TotalSegmentator shows high omega because the gates are catching the same issues over and over.

The implication: given constant gates, omega doubles as a model benchmark. It tells you not just how often a model fails, but how it fails, either in narrow, specific ways (low omega) or in broad, repeated ways (high omega).

Gate Architecture

Two kinds of verification.
One gate.

Each gate has a deterministic side and a stochastic side. Together they form the tri-state gate: pass, fail, or human.

Deterministic

Hard guarantees. Does it compile? Is the detected mass inside the organ boundary? Do the references exist? Pass or fail.

Stochastic

LLM opinion. Judgement calls with no definitive answer. Is the code quality acceptable? Is the segmentation plausible? Are the references relevant? Pass, fail, or escalate.

Pass → proceed · Fail → retry · Escalate → human decides

Deterministic Verifiers

The deterministic verifiers check things we know are true: does it pass lint rules? Is the detected mass inside the organ boundary? These are hard guarantees about the output. They either hold or they don't.

But some deterministic checks use empirical thresholds. Is the organ volume between 10 and 150cc? Is the centroid trajectory smooth enough? The gate itself is a pure function (same input, same output), but cases near the threshold are ambiguous. The centroid moved more than typical, but not enough to be implausible. What do you do with that?

This is where deterministic and stochastic verifiers connect. An inconclusive deterministic result is a natural handoff to the stochastic verifier. The deterministic gate narrowed the field—it's not definitely bad—and the LLM reviewer makes the judgment call on whether it's good enough.

Stochastic Verifiers

LLMs are imperfect verifiers. They may produce false positives, rejecting things that are actually correct. But they see things deterministic checks can't: whether the code is well-structured, whether the approach makes sense, whether the output matches the intent.

Some pipelines can tolerate a certain amount of uncertainty. A flag for repeated code is not a showstopper, but too much repetition indicates a real problem. An organ detected slightly outside anatomical norms is exceptional but not impossible.

Like the deterministic verifier, the LLM also has an escalation path if it is uncertain: the human.

The Tri-State Gate

People normally assume gates are pass/fail, but the real benefit comes from pass/fail/human. Deterministic tests provide hard guarantees and lead to hard fail. Stochastic gates are by definition based on probability, so uncertain decisions can be passed up to a human for final judgment.

The benefit of this third state is that you can handle them in two ways: accept them and escalate, or loosen the gate and let downstream gates catch the edge cases.

However, the real value comes from the filtering effect: the pipeline handles clear cases automatically. The human only sees things where they add unique expertise.

Diagnostics

Three signals your pipeline
needs attention.

High omega. Gates catching the same things. Add a different kind of check.

Escaped errors. Not enough coverage. Check the failure modes you're seeing in the final artifact.

Frequent escalations. A gate at its edge. Clustered escalations point at a high-value gate.

Where to Add the Next Gate

There's an old saying in engineering: you can have something high quality, built quickly, and within budget. Choose two. This applies to gate design as well. If we have to choose which gate to add to fit within our constraints, what gets the best return? Omega tells us: wherever your existing gates have the least coverage.

But omega only tells us about redundancy between existing gates. It says nothing about errors that escape the entire pipeline. How do we detect those?

Signal 1: High Omega

High omega means your gates are catching the same failures. Either the gates are too similar (design problem) or the model is weak enough that it fails broadly (model problem). Both are actionable: redesign the gate or swap the model.

Signal 2: Escaped Errors

Models generally fail in three ways: omissions, incorrectness, or incoherence. You can look at the type of failures appearing in the final product and reason about what gates might catch that kind of error. If you discover the model is consistently missing something, it's obvious to add a check for that thing.

A specific example: the Medical Imaging study mentioned earlier has one gate that checks centroid smoothness. It is the only test for that type of violation. Remove it, and a specific kind of error spikes.

Signal 3: Frequent Escalations

A large number of escalations from one gate indicates that it is on the edge of its ability. If those escalations are consistently about the same thing, they are a signpost pointing at a high-value gate. This is the tri-state gate doing its job: flagging exactly where you need to invest.

Getting Started

You already have gates.
Start measuring.

However you prompt an agent, you have some form of stages and verification. Even ad hoc corrections in chat logs count.

Explicit gates: Code compiled? Lint passed? Review approved? These are gates. Record what each one catches.

Implicit gates: You told the LLM to fix something it made? That's a failed gate. Use an LLM to classify your chat logs. Every correction is data.

End-user gates: Reader says "stop sending LLM slop"? The final verification step just failed. That's an escaped error.

"Add new coverage, check your omega, repeat."

Computing Omega on Your Own Pipeline

You can calculate omega on your pipeline today. The key is tracking rejections per task across gates. For each task that flows through your pipeline, record which gates rejected it and why.

The task-level omega is straightforward: how many tasks were rejected by more than one gate type, divided by the total rejected tasks? In my pipeline, 46 out of 389 rejected tasks had artifacts rejected by multiple gates. That's the headline number.

But you can go deeper. For those 46 tasks, were the gates catching the same issue or different issues? This is catch-level omega: compare the specific problems each gate flagged using text similarity. In my data, only 3 out of 702 classified catches were truly redundant. The rest were complementary: same task, different problems found by each gate.

You don't need a complex setup. However you prompt an agent, you have some form of stages and verification. Code compiled? Verification passed. You try the code and it corrupts your data? Failed. You tell it to draft an email, you send it, the reader asks you to stop sending LLM slop: verification failed. Count the failures per gate per task, calculate omega.

Mining Chat Logs

While structured gates are ideal, even ad hoc corrections count. Use an LLM to examine chat logs and classify each prompt, looking for prompts that ask the LLM to fix something it made. A correction like this is a gate that failed.

I provided an open source tool that analyzes Claude Code session logs. It uses LLM classification (with your API key) to extract gate decisions and error types from your Claude Code session logs. It discovers review gates automatically, classifies their outcomes, and categorizes the error types. Then it calculates your omega.

The Iteration Loop

More gates are better, though at some point you get diminishing returns. The important refinement is to add new coverage, check your omega to validate, and repeat. If omega goes up as you add gates, your new gates are redundant. If it goes down, they're catching new things.

Stabilization

How many tasks do you need before omega stabilizes? In practice, the pairwise numbers are consistent after about 100 rejected tasks. With fewer than that, individual rejections swing the ratio significantly. If you're just starting out, don't over-index on the exact number, look at the trend instead.

The Data

Global omega: 0.118.
Nearly zero true redundancy.

389 rejected tasks. 4 gate types. 3 out of 702 classified rejections were truly redundant.

Software pipeline 0.118 ≈ Bosma22b 0.125 · Cross-domain, same pattern.

The Software Pipeline

The analysis ran across a software delivery pipeline with four gate types: plan review, design review, agentic cross-artifact code review, and single file code review. These are spread across distinct stages, so a plan review gate and a code review gate are looking at completely different artifacts.

The global omega across 389 rejected tasks: 0.118. Very low. The gates are doing almost entirely independent work. (In the spirit of full disclosure: most of this work is done with Opus 4.5 or 4.6, so model strength also contributes to low omega.)

Per-Gate Rejection Rates

Not all gates work equally hard. Looking at the per-gate rejection rates across tasks reveals which gates are doing the heavy lifting:

Gate	Tasks Seen	Tasks Rejected	Rejection Rate
Plan Review	214	153	71.5%
Design Review	580	241	41.6%
Code Review	148	42	28.4%

Plan review rejects nearly three quarters of the initial artifacts for the tasks it sees. By the time work reaches code review, only 28% gets rejected. This is the cascade effect: upstream gates filter out problems early, so downstream gates see cleaner work.

This is useful in several ways. A gate with a high rejection rate is doing the most filtering. That's either a sign that it's catching problems at the cheapest stage (good pipeline design), or that the stage feeding it is producing low-quality output (model or prompt problem). A gate with a very low rejection rate might be redundant, or it might be the only gate catching a rare but critical class of error. The rejection rate alone doesn't tell you which, but combined with omega it does: low rejection rate + low pairwise omega means the gate is catching something unique that nothing else sees.

Pairwise Breakdown

Gate Pair	Shared	Union	Omega
code review ↔ plan review	36	159	0.226
design review ↔ plan review	10	384	0.026
code review ↔ design review	2	281	0.007

Code review and plan review have the most overlap at 0.226. Thirty six tasks had artifacts that were rejected at both stages. This makes sense as sometimes a plan-level issue resurfaces at the code level because the fix didn't fully address it. But even here, it's modest. Design review is nearly independent from everything else: 0.026 and 0.007.

True Redundancy

The most telling result: when looking at the error classes of shared rejections—cases where two gates rejected artifacts from the same task for the same type of error—there were exactly 3 instances out of 702 classified rejections. Almost zero true redundancy. The rest were complementary: same task, different problems caught by each gate.

Cross-Domain Comparison

The software pipeline omega (0.118) and the strong prostate model (0.125) are nearly identical despite being completely different domains. The weak multi-organ model (0.767) is six times higher. Same verification architecture, same pattern: the topology determines coverage quality, and model quality determines where on the spectrum you land.

The Punchline

Don't just benchmark the model.
Benchmark the pipeline.

Two models can have identical accuracy scores but very different omegas, because one fails in ways your gates can see and the other fails in ways they can't.

Model benchmarks measure the model in isolation.
Omega measures the model in the context of your verification infrastructure.

1. Identify your gates, both explicit and implicit.

2. Record what each gate catches.

3. Compute omega. Add coverage where overlap is low. Repeat.

"We measure how good the engine is, but not how good the car is."

The Asymmetry

There's an asymmetry worth making explicit: model benchmarks measure the model in isolation. Omega measures the model in the context of your verification infrastructure. Two models can have identical accuracy scores but very different omegas, because one fails in ways your gates can see and the other fails in ways they can't.

What the Industry is Missing

This points at something the industry hasn't quite internalized yet. We have extensive benchmarks for models. We have none for pipelines. We measure how good the engine is, but not how good the car is.

Omega is a step toward benchmarking the pipeline itself. It doesn't tell you everything. It can't see errors that escape entirely, and it can't tell you whether your gates are catching the right things. But it tells you whether your verification infrastructure is doing independent work or just running the same check multiple times. And it gives you a concrete number to track as you iterate.

Tracking Over Time

You can track omega over time as you improve your models. If you swap to a stronger model and omega drops, the new model is failing in more specific ways. Your verification infrastructure is doing more differentiated work. That's a signal that both the model and the pipeline are improving together.

Questions Worth Sitting With

Is omega = 0 the ideal? Not necessarily. Some overlap provides a safety net. In my data, everything below 0.312 had gates delivering value. At 0.767, the gates were clearly redundant. The value flips somewhere between those numbers, though I don't have the data to narrow it down.
How many tasks before omega stabilizes? I saw it get to around 10% of its final value after about 75 rejected tasks. The first 30 were too noisy to be meaningful, but after that look at the trend, not the exact number.
Can you compare omega across teams? Yes: if you hold gates constant, omega becomes a model diagnostic. If you hold models constant, it becomes a pipeline diagnostic.