Overlap Ratio

X-Raying Your AI Pipeline

Don't just benchmark the model. Benchmark the pipeline.

Model benchmarks measure models in isolation, but we deploy models in pipelines. The overlap ratio measures whether your verification gates are doing independent work or just running the same check twice.

0.118 software pipeline omega  ·  3 medical imaging models  ·  4 gate types

The Narrative Arc

This presentation introduces the overlap ratio: a single number that tells you whether your AI verification pipeline is doing independent work or wasting effort on redundant checks. It draws on empirical data from a software delivery pipeline and cross-validates against medical imaging experiments.

The argument proceeds in four stages:

  1. The problem: Model benchmarks measure the wrong thing. We benchmark models in isolation but deploy them in pipelines. The quality of the final artifact depends on the pipeline, not just the model.
  2. The concept: Two gates checking the same thing is waste. The overlap ratio quantifies how much of your verification is redundant.
  3. The evidence: Same gates, three models, completely different omegas. Cross-domain validation from software delivery and medical imaging.
  4. The practice: You can compute omega today. Three signals tell you where your pipeline is weak.

Prior Work

This presentation builds on three prior publications:

The overlap ratio operationalizes one of the four properties from Trust Topology, making it concrete and measurable for practitioners.

About

Michael Rothrock is a software engineering leader with 35 years of experience building trusted systems. This research documents patterns discovered through daily use of autonomous AI agents across 8 concurrent projects.

The Problem

Model benchmarks measure
the wrong thing.

We benchmark models in isolation but deploy them in pipelines. The quality of the final artifact depends on the arrangement, not just the engine.

What We Measure

Model accuracy on benchmarks. Output quality in isolation. Single-turn correctness. The engine.

What We Deploy

Multi-stage pipelines. Verification gates. Human escalation. Orchestrated workflows. The car.

"We measure engine quality but not the quality of the car."

The Gap Between Benchmarks and Deployment

Model performance matters. The existing benchmarks measure model output quality in a variety of ways. However, when we deploy models, we orchestrate them into a workflow. The quality of intermediate artifacts is relevant, but the thing we really care about is the quality of the final artifact. A better measure for practitioners would focus on how the agent's work composes and contributes to that end result.

Stages and Gates

In practice, work flows through a series of stages that produce intermediate results. Every stage is verified by gates that either pass, require a retry, or escalate to a human for a decision. Each gate sees the work as it progresses and applies specific, state-appropriate checks before errors propagate and get magnified by the rest of the chain.

Decomposing work into incremental stages with checkpoints is generally a good practice—we've been doing this long before agents. But how do we know that adding a new check actually adds value? How do we decide what needs more coverage versus what's good enough?

The Missing Metric

Intuitively, it's pretty clear: we don't need a new check that only verifies the same thing as an existing one. We can talk about defense in depth, where we verify something from multiple angles. But if two tests are identical, they are wasteful redundancy. What's missing is a way to measure this: a number that tells you whether your verification infrastructure is doing independent work or repeating itself.

The Intuition

Two guards checking IDs
is one guard.

The value of a second check depends entirely on whether it catches something the first one doesn't.

High overlap. Both guards check the same IDs. The second guard adds nothing. You're paying for the same check twice.

Low overlap. One checks IDs, one checks packages. They reject different things. Each guard earns their place.

"You need a different kind of gate, not more passes through the same one."

The Security Guard Problem

Imagine two security guards at a building. If they both check IDs, that second guard isn't adding anything. But if one checks IDs and one checks packages, they have no overlap. They catch different things.

This is intuitively clear, but we can also put a number on this and reason about it empirically. Two checks that reject identical things have 100% overlap, or a score of 1. Two checks that reject entirely different things have 0.

In practice, complex gates might reject something for a blend of tests that trigger past a threshold. Two gates end up with some overlap: they reject some of the same things, perhaps for different reasons. The overlap can be any value from 0 to 1.

We'll talk about this quite a bit later, so instead of constantly saying "overlap ratio," we'll just call it omega, or ω.

High Overlap Both guards check IDs Gate A Gate B shared ω ≈ 1 Redundant — second gate adds nothing Low Overlap One checks IDs, one checks packages Gate A Gate B ω ≈ 0 Complementary — each gate catches unique errors

Connection to Inference Scaling

The inference-scaling literature hits a version of this problem. Brown et al.1 found that common methods for picking correct solutions from many samples, like majority voting and reward models, plateau beyond several hundred samples. The framework predicts this. When your verification signals have high overlap, more samples cannot help. You need a different kind of gate, not more passes through the same one.

1 Brown et al., "Large Language Monkeys: Scaling Inference Compute with Repeated Sampling," 2024. arxiv.org/abs/2407.21787

The Metric

One number.
How redundant are your gates?

ω = shared catchestotal unique catches

ω = 0

Perfectly complementary
Every gate catches unique errors

ω = 1

Perfectly redundant
Every gate catches the same errors

Defining Omega

Omega is the count of things caught by multiple gates divided by the total things caught by all gates. If every gate catches the same things, omega is 1. If each gate catches something unique, omega is 0.

You can measure this two ways: you can either consider all the catches across all the gates in your entire pipeline, or you can describe it for a subset. It can work for just two gates if you want to zoom in on one particular part of your pipeline.

This lets you objectively describe how much overlap your pipeline has overall. Or you can see how much one section has. This informs where you might place a new gate because there is too much overlap.

As you'd expect, omega changes with the choice of gates. But here's something interesting: it also changes with the choice of model.

Omega as a Model Diagnostic

A pipeline's omega is defined by what its gates catch. A system designer can judge gate overlap as they chain things together. But what the gates actually catch in practice is also driven by model output.

Given constant gates, you can use omega as a model benchmark. A strong model performs well across a variety of challenges and fails narrowly. Gates pick up specific edge cases. A weak model fails in ways that are broadly wrong and are observable from a variety of perspectives, seen by many gates.

A weak model makes the same fundamental mistakes everywhere. A strong model fails in isolated ways that only certain gates can see.

The Practical Range

Is omega = 0 the ideal? Not necessarily. Some overlap provides a safety net. If a stochastic verifier misses something it usually catches, a second gate covering the same ground catches it instead. The question is how much overlap is useful redundancy versus waste. In my own practice, anything above 0.3 suggests that you're paying for coverage you might already have.

The Evidence

Same gates, three models.
Completely different omegas.

Constant verification. Variable model.
Omega changes with model strength.

Bosma22b
0.125
MONAI
0.312
TotalSegmentator
0.767
"A strong model fails in narrow ways. A weak model makes the same mistakes everywhere."

Medical Imaging Experiment

The experiment ran the same verification gates across three models of various strength on radiological images. The results:

ModelOmegaStory
Bosma22b0.125Each gate catches almost entirely unique errors
MONAI0.312Some overlap, but each gate still earns its place
TotalSegmentator0.767Gates are mostly seeing the same failures

The Shotgun Metaphor

Model performance is like a shotgun. A strong model concentrates its output in a small area, and the gates pick up specific edge cases. A weak model sprays output and issues across a larger cone, hitting most of the gates.

Strong Model Fails specifically — low ω Gate A Gate B Gate C Gate D ω = 0.125 Each gate catches something unique Weak Model Fails globally — high ω Gate A Gate B Gate C Gate D ω = 0.767 Every gate catches the same spray

Bosma22b is the strongest model here, well suited to the task. The pipeline shows low omega because each gate is doing unique work. TotalSegmentator shows high omega because the gates are catching the same issues over and over.

The implication: given constant gates, omega doubles as a model benchmark. It tells you not just how often a model fails, but how it fails, either in narrow, specific ways (low omega) or in broad, repeated ways (high omega).

Gate Architecture

Two kinds of verification.
One gate.

Each gate has a deterministic side and a stochastic side. Together they form the tri-state gate: pass, fail, or human.

Deterministic

Hard guarantees. Does it compile? Is the detected mass inside the organ boundary? Do the references exist? Pass or fail.

Stochastic

LLM opinion. Judgement calls with no definitive answer. Is the code quality acceptable? Is the segmentation plausible? Are the references relevant? Pass, fail, or escalate.

Pass  →  proceed   ·   Fail  →  retry   ·   Escalate  →  human decides

Deterministic Verifiers

The deterministic verifiers check things we know are true: does it pass lint rules? Is the detected mass inside the organ boundary? These are hard guarantees about the output. They either hold or they don't.

But some deterministic checks use empirical thresholds. Is the organ volume between 10 and 150cc? Is the centroid trajectory smooth enough? The gate itself is a pure function (same input, same output), but cases near the threshold are ambiguous. The centroid moved more than typical, but not enough to be implausible. What do you do with that?

This is where deterministic and stochastic verifiers connect. An inconclusive deterministic result is a natural handoff to the stochastic verifier. The deterministic gate narrowed the field—it's not definitely bad—and the LLM reviewer makes the judgment call on whether it's good enough.

Stochastic Verifiers

LLMs are imperfect verifiers. They may produce false positives, rejecting things that are actually correct. But they see things deterministic checks can't: whether the code is well-structured, whether the approach makes sense, whether the output matches the intent.

Some pipelines can tolerate a certain amount of uncertainty. A flag for repeated code is not a showstopper, but too much repetition indicates a real problem. An organ detected slightly outside anatomical norms is exceptional but not impossible.

Like the deterministic verifier, the LLM also has an escalation path if it is uncertain: the human.

Artifact Deterministic lint · compile · schema PASS continues FAIL rejected uncertain Stochastic LLM reviewer PASS FAIL human

The Tri-State Gate

People normally assume gates are pass/fail, but the real benefit comes from pass/fail/human. Deterministic tests provide hard guarantees and lead to hard fail. Stochastic gates are by definition based on probability, so uncertain decisions can be passed up to a human for final judgment.

The benefit of this third state is that you can handle them in two ways: accept them and escalate, or loosen the gate and let downstream gates catch the edge cases.

However, the real value comes from the filtering effect: the pipeline handles clear cases automatically. The human only sees things where they add unique expertise.

Diagnostics

Three signals your pipeline
needs attention.

High omega. Gates catching the same things. Add a different kind of check.

Escaped errors. Not enough coverage. Check the failure modes you're seeing in the final artifact.

Frequent escalations. A gate at its edge. Clustered escalations point at a high-value gate.

Where to Add the Next Gate

There's an old saying in engineering: you can have something high quality, built quickly, and within budget. Choose two. This applies to gate design as well. If we have to choose which gate to add to fit within our constraints, what gets the best return? Omega tells us: wherever your existing gates have the least coverage.

But omega only tells us about redundancy between existing gates. It says nothing about errors that escape the entire pipeline. How do we detect those?

Omega Gate redundancy 0.3 0.118 0 1.0 Escapes Pipeline escapes review 10% 0% 100% Escalations Gate uncertainty review 20% 0% 100%

Signal 1: High Omega

High omega means your gates are catching the same failures. Either the gates are too similar (design problem) or the model is weak enough that it fails broadly (model problem). Both are actionable: redesign the gate or swap the model.

Signal 2: Escaped Errors

Models generally fail in three ways: omissions, incorrectness, or incoherence. You can look at the type of failures appearing in the final product and reason about what gates might catch that kind of error. If you discover the model is consistently missing something, it's obvious to add a check for that thing.

A specific example: the Medical Imaging study mentioned earlier has one gate that checks centroid smoothness. It is the only test for that type of violation. Remove it, and a specific kind of error spikes.

Signal 3: Frequent Escalations

A large number of escalations from one gate indicates that it is on the edge of its ability. If those escalations are consistently about the same thing, they are a signpost pointing at a high-value gate. This is the tri-state gate doing its job: flagging exactly where you need to invest.

Getting Started

You already have gates.
Start measuring.

However you prompt an agent, you have some form of stages and verification. Even ad hoc corrections in chat logs count.

Explicit gates: Code compiled? Lint passed? Review approved? These are gates. Record what each one catches.

Implicit gates: You told the LLM to fix something it made? That's a failed gate. Use an LLM to classify your chat logs. Every correction is data.

End-user gates: Reader says "stop sending LLM slop"? The final verification step just failed. That's an escaped error.

"Add new coverage, check your omega, repeat."

Computing Omega on Your Own Pipeline

You can calculate omega on your pipeline today. However you prompt an agent, you have some form of stages and gates. Your pipeline begins with a job to do: you provide a specification and tell it to build, or you provide an image and ask it to classify, or you explain your task and have it write a document.

Some pipelines have explicit intermediate artifacts—and I'd encourage you to do that—but even the simplest has some kind of final verification step. You tell it to draft an email, you send it, the reader asks you to stop sending LLM slop: verification failed. (And, really, you should also check this yourself before you click send.)

Code compiled? Verification passed. You try the code and it corrupts your data? Failed. It tells you an image of a prostate is a kidney? Failed.

Count the failures, calculate omega.

Mining Chat Logs

While structured gates are ideal, even ad hoc corrections count. Use an LLM to examine chat logs and classify each prompt, looking for prompts that ask the LLM to fix something it made. A correction like this is a gate that failed.

I provided an open source tool that analyzes Claude Code session logs. It uses LLM classification (with your API key) to extract gate decisions and error types from your Claude Code session logs. It discovers review gates automatically, classifies their outcomes, and categorizes the error types. Then it calculates your omega.

The Iteration Loop

More gates are better, though at some point you get diminishing returns. The important refinement is to add new coverage, check your omega to validate, and repeat. If omega goes up as you add gates, your new gates are redundant. If it goes down, they're catching new things.

Stabilization

How many tasks do you need before omega stabilizes? In practice, the pairwise numbers are consistent after about 100 rejected tasks. With fewer than that, individual rejections swing the ratio significantly. If you're just starting out, don't over-index on the exact number, look at the trend instead.

The Data

Global omega: 0.118.
Nearly zero true redundancy.

389 rejected tasks. 4 gate types. 3 out of 702 classified rejections were truly redundant.

Pairwise Gate Overlap Software pipeline — 389 rejected tasks Code Design Plan Code Design Plan 0.007 nearly zero 0.226 moderate 0.007 nearly zero 0.026 very low 0.226 moderate 0.026 very low Color intensity = overlap. White = independent. Teal = shared catches.

Software pipeline 0.118  ≈  Bosma22b 0.125  ·  Cross-domain, same pattern.

The Software Pipeline

The analysis ran across a software delivery pipeline with four gate types: plan review, design review, agentic cross-artifact code review, and single file code review. These are spread across distinct stages, so a plan review gate and a code review gate are looking at completely different artifacts.

The global omega across 389 rejected tasks: 0.118. Very low. The gates are doing almost entirely independent work. (In the spirit of full disclosure: most of this work is done with Opus 4.5 or 4.6, so model strength also contributes to low omega.)

Pairwise Breakdown

Gate PairSharedUnionOmega
code review ↔ plan review361590.226
design review ↔ plan review103840.026
code review ↔ design review22810.007

Code review and plan review have the most overlap at 0.226. This makes sense as sometimes a plan-level issue resurfaces at the code level because the fix didn't fully address it. But even here, it's modest. Design review is nearly independent from everything else: 0.026 and 0.007.

True Redundancy

The most telling result: when looking at the error classes of shared rejections—cases where two gates rejected the same task for the same type of error—there were exactly 3 instances out of 702 classified rejections. Almost zero true redundancy. The rest were complementary: same task, different problems caught by each gate.

Cross-Domain Comparison

The software pipeline omega (0.118) and the strong prostate model (0.125) are nearly identical despite being completely different domains. The weak multi-organ model (0.767) is six times higher. Same verification architecture, same pattern: the topology determines coverage quality, and model quality determines where on the spectrum you land.

The Punchline

Don't just benchmark the model.
Benchmark the pipeline.

Two models can have identical accuracy scores but very different omegas, because one fails in ways your gates can see and the other fails in ways they can't.

Model benchmarks measure the model in isolation.
Omega measures the model in the context of your verification infrastructure.

1. Identify your gates, both explicit and implicit.
2. Record what each gate catches.
3. Compute omega. Add coverage where overlap is low. Repeat.
"We measure how good the engine is, but not how good the car is."

The Asymmetry

There's an asymmetry worth making explicit: model benchmarks measure the model in isolation. Omega measures the model in the context of your verification infrastructure. Two models can have identical accuracy scores but very different omegas, because one fails in ways your gates can see and the other fails in ways they can't.

What the Industry is Missing

This points at something the industry hasn't quite internalized yet. We have extensive benchmarks for models. We have none for pipelines. We measure how good the engine is, but not how good the car is.

Omega is a step toward benchmarking the pipeline itself. It doesn't tell you everything. It can't see errors that escape entirely, and it can't tell you whether your gates are catching the right things. But it tells you whether your verification infrastructure is doing independent work or just running the same check multiple times. And it gives you a concrete number to track as you iterate.

Tracking Over Time

You can track omega over time as you improve your models. If you swap to a stronger model and omega drops, the new model is failing in more specific ways. Your verification infrastructure is doing more differentiated work. That's a signal that both the model and the pipeline are improving together.

Questions Worth Sitting With

  • Is omega = 0 the ideal? Not necessarily. Some overlap provides a safety net. In my data, everything below 0.312 had gates delivering value. At 0.767, the gates were clearly redundant. The value flips somewhere between those numbers, though I don't have the data to narrow it down.
  • How many tasks before omega stabilizes? I saw it get to around 10% of its final value after about 75 rejected tasks. The first 30 were too noisy to be meaningful, but after that look at the trend, not the exact number.
  • Can you compare omega across teams? Yes: if you hold gates constant, omega becomes a model diagnostic. If you hold models constant, it becomes a pipeline diagnostic.