Trust Topology

Reliability is not a model property.
It's a property of the arrangement.

A practitioner's framework for engineering trust from unreliable agents.

Alignment is undecidable. Trust is measurable. Over 97 days I accumulated 5,109 cross-model review gate checks and classified every rejection. The arrangement of those gates determines system reliability more than the capability of any model inside them.

5,109 gate checks · 1,450 genuine rejections · 97 days · 8 projects

The Narrative Arc

This presentation introduces Trust Topology: a design calculus for reasoning about verification pipelines in AI agent systems. It draws on extensive field data from autonomous AI agents shipping production code, and connects that data to the inference-scaling literature, distributed systems theory, and computability theory.

The argument proceeds in four stages:

The problem: LLM output is unpredictable in ways automated pipelines were never designed to handle. But the failures have structure that can be exploited.
The reorientation: Distributed systems researchers solved a similar problem by making reliability a protocol property, not a node property. AI agents are just another unreliable component.
The framework: Four properties—overlap ratio, verification amplification, the deterministic ceiling, and the liveness constraint—form a design calculus for gate topologies.
The dynamics: Topologies evolve. The deterministic boundary migrates as operational experience accumulates. The system learns without changing any model.

Prior Work

This presentation builds on two prior publications:

543 Hours of Autonomous AI—the workflow methodology and operational patterns
Gate Analysis—the empirical data on error taxonomy, decomposition, and gate specificity

Trust Topology is the theoretical framework that explains why the patterns in those studies work.

About

Michael Rothrock is a software engineering leader with 35 years of experience building trusted systems. This research documents patterns discovered through daily use of autonomous AI agents across 8 concurrent projects.

The Problem

LLM output looks right.
Fails unpredictably.

Automated pipelines assume components fail in predictable ways. LLM output violates that assumption. It can be incomplete, fabricated, or plausible yet wrong in wildly different ways.

Incomplete. Requirements left out, components not implemented, edge cases ignored.

Fabricated. Plausible code that compiles, passes linting, and does the wrong thing.

Contradictory. Internally inconsistent, self-defeating logic that looks coherent on the surface.

"Except, it turns out that most of the failures aren't unpredictable at all. They have structure, and systems can exploit that structure."

The CI/CD Gap

Human time is more important than machine time. When we deliver software, we embrace this concept through automation: if a task can be done successfully, without supervision, by a machine, then it should be done by a machine. The entire practice of CI/CD rests on this idea. Automation gives us an additional benefit: it is repeatable, which means it is predictable, which means it is trustable.

But what happens when the thing being automated is itself unpredictable? We know how to handle components that fail in conventional ways. LLM output is different. Dealing with output that looks right but fails unpredictably is fundamentally harder than anything CI/CD solved thus far.

The Practitioner's Question

Given that we cannot treat model output as reliable, how do we engineer systems that are reliable anyway?

This is not a new question. It's the same question that distributed systems researchers answered decades ago, and it has the same answer: make reliability a property of the protocol, not the nodes.

This question applies wherever AI agents turn human intent into concrete artifacts. I use software development as the illustrative domain because it is both my area of expertise and it offers the richest existing verification infrastructure, but the framework is domain-general.

Distributed Systems

They stopped making nodes reliable.
They made protocols reliable.

AI agents are just another unreliable component.

Component Approach	System Approach
How many samples should I draw?	How should gates be arranged?
How large should my verifier be?	What makes one topology better?
How do I allocate compute?	Where does verification necessarily fail?

"Most of the literature still treats verification as a component choice, not a pipeline topology problem."

The Inference-Scaling Literature

The inference-scaling research community is converging on a version of this insight, but from the component side:

Brown et al. ("Large Language Monkeys," 2024): Sampling many times from a weak model and filtering with a verifier can outperform a single sample from a much stronger model.
Snell et al. ("Scaling LLM Test-Time Compute," 2024): Optimally allocating test-time compute between generation and verification can substitute for a 14x increase in model size.
Lu et al. ("When Does Verification Pay Off?," 2025): Using a different model family as verifier produces substantially higher gains than self-verification, because models are biased toward accepting incorrect solutions that resemble their own reasoning patterns.

These are important results. But they mostly operate within the same frame: one model, one verifier, one stage. The unit of analysis is the component.

Why This Must Be a Systems Question

A single model call is bounded computation. But once you wrap a model in an agentic loop with tool calls and persistent state, you've built a program. The model proposes; the loop decides what to do next. These systems can, in principle, simulate arbitrary computation and Melo et al. show that alignment for such systems is formally undecidable. You can't prove an agentic system will always do the right thing.

This doesn't diminish model-level alignment research. Better models make every downstream system better. But model-level alignment alone cannot guarantee system-level correctness. Practitioners don't solve alignment—they engineer trust.

Human Intent

Intent is unobservable. A user's goal exists in their head. Every artifact the system produces, such as a spec, a plan, a design, or code, is a projection of that intent into a lower fidelity representation. None of them fully capture the original because the full depth of intent resides in only the inner mental world of the human. Each stage decompresses one compact representation into a more elaborate one: a one-sentence goal becomes a plan, the plan becomes a design, the design becomes code.

Verification, then, is not checking whether an artifact is "correct" in some absolute sense. It is checking whether each projection is consistent with the projections that came before it. Gates verify consistency across projections, never correspondence to intent itself. This distinction matters because it bounds what any verification pipeline can achieve, no matter how many gates you add. It is also why the system needs an escalation path.

The Framework

A trust topology is the arrangement of generators, verifiers, and context boundaries that determines which errors are observable, which are catchable, and which are recoverable. It has three design levers:

Decomposition controls reasoning chain length. Bounded tasks with fresh contexts convert incoherence into omission.
Verification diversity controls overlap ratio. Gates operating on different artifact types and scopes reduce overlap and expose blind spots.
Oracle routing controls escalation. A stochastic verifier triages between auto-fix and human judgement, preserving human authority without forcing them to review everything.

These levers shape the topology. Four diagnostic properties determine whether it actually works. The next slides unpack them.

What I Found

5,109 gate checks.
87% of errors are predictable.

The majority of agent failures are mundane. Things left out, or things done consistently wrong.

Omission

49%

Systematic

38%

Incoherent

12.7%

A gate can only catch what it can see. File-scoped code review:

0% incoherence detection.

The Study

Over 97 days, I ran autonomous AI agents that ship production code. Four mandatory review gates stand between generated output and a shipped release, combining stochastic verifiers (an LLM judging whether work meets requirements) with deterministic checks (linting, tests, structural validation). The four gates are plan review, design review, file-scoped code review, and full-context code review.

I accumulated 5,109 gate checks across 8 projects and classified every one of the 1,450 genuine rejections. The full empirical analysis is published at michael.roth.rocks/research/gate-analysis/.

Three Surprises

Finding	Detail
Error taxonomy	Only 12.7% incoherent. 49% omissions, 38% systematic. The failures have structure.
Decomposition	11-hour release arcs show 10% incoherence. Feature arcs (longer chains) show 19.8%. The system-level property beats the component-level property.
The trade-off	Decomposition converts incoherence into omission. Bounded contexts mean bounded awareness. Agents mostly forget rather than contradict. Omissions are the easiest error class to catch at a gate.
Gate specificity	Plan gates catch omissions (54%). Design gates catch systematic errors (48%). File-scoped code review catches 0% incoherence because its window is too narrow.

The Omission Trade-Off

The trade-off is directional. Decomposition converts incoherence into omission. Bounded contexts mean bounded awareness: an agent working within a single task cannot see decisions made in a sibling context. It can still contradict them by coincidence, but it cannot sustain the kind of compounding contradiction that emerges from extended reasoning over a drifting context. It forgets rather than contradicts. The gated workflow shows 49% omission versus 36% in unstructured population data (664 public sessions; Rothrock, 2026, slide 11), almost perfectly mirroring the incoherence reduction. This is a favorable trade. Omissions are the easiest error class to catch at a gate because a checklist can surface a missing test or an unhandled edge case. Incoherence requires the reviewer to hold two contradictory states in mind and recognize the conflict.

Per-Gate Breakdown

Gate	Checks	Rejection Rate	Top Error Type	Incoherent
Plan	1,193	61%	Omission (54%)	10.5%
Design	1,491	37%	Systematic (48%)	15.6%
Code (file)	340	40%	Systematic (56%)	0%
Code (system)	2,085	28%	Omission (55%)	16%

Trust Topology

Each gate must catch what others miss.
Redundancy is the enemy.

Two properties determine whether gates compose or merely repeat.

Overlap ratio. If two gates reject the same artifacts 80% of the time, you don't have two gates. You have one gate that runs twice. The lower the overlap, the more each successive gate contributes.

Verification amplification. Upstream gates constrain what downstream gates must check. A weak upstream gate is the most expensive gap, because it passes flawed artifacts that waste cycles everywhere downstream.

"You need a different kind of gate, not more passes through the same one."

Overlap Ratio

Each gate filters out incorrect artifacts, but if two gates catch the same errors, the second one contributes nothing. The overlap between gates is measurable: take the union of their rejection sets and compare it to the sum. If two gates reject the same artifacts 80% of the time, you don't have two gates. You have one gate that runs twice. The lower the overlap, the more each successive gate contributes to reducing the remaining error set.

The inference-scaling literature hits a version of this problem: Brown et al. found that common methods for picking correct solutions from many samples, like majority voting and reward models, plateau beyond several hundred samples. The framework predicts this. When your verification signals have high overlap, more samples cannot help. You need a different kind of gate, not more passes through the same one.

Error budget burns down.

Grey pillars show remaining errors · floating blocks show what each gate caught

The ghosted strikethrough at gate 3 shows the missing incoherent catch. Gate 4's red block compensates.

Verification Amplification

Upstream gates constrain the input to downstream gates. The plan gate has the highest rejection rate (61%) because it operates on the broadest scope: the first artifact created from expressed human intent. By the time work reaches the code review gate, the input has already been validated for intent, structure, and design. Each upstream gate reduces the burden on every gate that follows, because a downstream verifier checking against a well-formed plan can apply more specific predicates than one checking against a vague plan.

A weak upstream gate is the most expensive place to have a gap, because it passes flawed artifacts that waste cycles everywhere downstream. This is asymmetric and only flows forward.

Verification amplification explains why the process reward model literature (Lightman et al.) consistently finds that step-level verification outperforms outcome-only verification. Gating intermediate representations constrains what downstream steps can produce. This is process supervision at the system level: heterogeneous verifiers applied to pipeline stages rather than a learned reward model applied to reasoning steps.

Closing in on correctness.

Four concentric gates narrow the space of acceptable output

The broken third ring is the diagram's thesis: a gate can only catch what it can see.

A Note on Precision

Overlap ratio and verification amplification operate at different scales. Overlap ratio is a within-stage property: multiple checks on the same artifact, in the same representation space. You can directly compare what each check catches. Verification amplification is a between-stage property: plan gates filter plans, design gates filter designs, code gates filter code. These are different spaces. You cannot simply add up what they catch as if it were a single pool.

The way between-stage gates help each other is not by removing errors from a shared set, but by shaping what the next stage receives. A good plan gate means the design stage starts with better input, which means the design gate can check more specific things. Each gate improves the odds for the gates that follow.

Trust Topology

The ceiling is structural.
Not statistical.

Two more properties bound what verification can and cannot achieve.

The deterministic ceiling. Deterministic checks provide hard guarantees. But structural correctness is not semantic correctness. Code can compile, pass all linting, conform to the schema, and still do the wrong thing. No amount of deterministic gating closes this gap.

The liveness constraint. Each gate narrows the space of acceptable output. If the gates collectively eliminate 99% of LLM output, the system will be stuck in retries.

"No amount of repeated sampling or verifier compute can push past the deterministic ceiling if the gates cannot observe the property you care about."

The Deterministic Ceiling

Every gate has an observability limit. Deterministic checks (valid JSON, compilable code, schema conformance) provide hard guarantees: if the output fails, it is provably wrong. But structural correctness is not semantic correctness. Code can compile, pass all linting checks, conform to the schema, and still do the wrong thing. The gap between structural and semantic verification is where the hardest residual errors live, and no amount of deterministic gating closes it.

An LLM verifier covers some of this gap because it can judge whether code does the right thing, not just whether it compiles. But it does so without formal guarantees. So reliability splits into two layers: a deterministic floor (provable—tests either pass or they don't) and a stochastic uplift (estimated—the LLM reviewer's judgment). The boundary between them is sharp and knowable.

This is the framework's hardest boundary. No amount of repeated sampling or verifier compute can push past the deterministic ceiling if the gates can't observe the property you care about. The ceiling is structural, not statistical. It is also what prevents the framework from claiming to solve alignment—it explicitly states the limits of what verification can achieve.

The ceiling splits verification.

What passes sharp, what scatters, and what's invisible

The deterministic beam passes sharp. The stochastic cone scatters. The unobservable zone is void.

Source of Truth

There is a boundary even deeper than the deterministic ceiling. No downstream processing can recover more about intent than the specification contains. If the spec is ambiguous, incomplete, or wrong, perfect gates still produce wrong output with perfect consistency. The pipeline guarantees fidelity to specification, not fidelity to intent.

This is why oracle routing matters beyond operational convenience. Escalation to the human is the only path to get direct information about intent from the actual source. Each oracle response improves the fidelity of the specification to the actual intent by asking the person directly. Every other stage can only lose information about intent; escalation is the one mechanism that can recover it.

The Liveness Constraint

Each gate narrows the space of acceptable output. If the gates collectively eliminate 99% of potential LLM output, the system will be stuck in retries. You cannot achieve perfect reliability by adding gates. There is a practical limit, and finding it is an engineering problem.

The 55% first-pass approval rate in the data suggests the system is already operating with moderately tight acceptance sets. Adding another gate would increase correctness guarantees but decrease throughput. The design question is always: does this gate catch enough new errors to justify the liveness cost?

Correctness costs throughput.

The trade-off between verification strictness and system liveness

Correctness and throughput are opposing forces. The 4-gate topology is the operating point;
one more gate tips toward retry storms.

The Four Properties Together

Property	Question It Answers
Overlap ratio	Are my gates catching different errors?
Verification amplification	Are upstream gates reducing downstream burden?
Deterministic ceiling	What can my gates actually prove?
Liveness constraint	Can the system still produce output?

The Dynamics

The deterministic ceiling rises.
Humans retreat.

Topologies evolve. The system learns without changing any model.

Oracle human judgment · high cost
ground truth

Stochastic LLM review · moderate cost
probabilistic

Deterministic tests, linters, types · near-zero cost
provable

Oracle routing. The stochastic gate doesn't just verify, it triages. Issues classified as auto-fixable never reach the human. This is the mechanism that makes the three-tier architecture practical.

"What required LLM judgment last month becomes a regex this month."

The Three Regimes

Every gate operates in one of three regimes:

Deterministic checks: schema validation, type checking, linting. Hard guarantees at near-zero cost.
Semantic checks: an LLM evaluating whether the output satisfies intent. Probabilistic judgment at moderate cost.
Oracle checks: a human rendering a decision. Ground truth at high cost and low throughput.

Oracle Routing

The tiers are not parallel tracks—the stochastic gate actively routes between them. The LLM reviewer classifies each issue it finds as either auto-fixable or requiring a human decision, and escalates accordingly. This is oracle routing: the engineering pattern that makes the three-tier architecture practical. Without it, the human would have to review everything. The four properties above are diagnostic: they tell you how to evaluate a topology. Oracle routing is prescriptive: it tells you how to build one that works.

How Boundaries Migrate

The boundaries between these regimes are not fixed. When a semantic gate repeatedly rejects the same class of error, that pattern can be codified into a deterministic check. When a human repeatedly makes the same architectural decision through the oracle tier, that decision can be encoded as a semantic rule the LLM verifier applies automatically.

That migration is the central dynamic of the system. Reliability improves even if the models stay the same, because the verification topology is learning from operational experience. The trust surface grows over time.

Boundaries migrate. Humans retreat.

How verification responsibility shifts between tiers over 17 weeks

The deterministic ceiling (teal) rises as tests accumulate. The stochastic ceiling (red curve) rises as the knowledge base grows. New features temporarily notch the deterministic ceiling.

Predictive Power

This makes the framework predictive, not just descriptive. Given a topology, you can ask:

Which errors currently handled by the semantic tier are candidates for deterministic promotion?
Which oracle decisions recur frequently enough to encode as semantic rules?

The rate of boundary migration is a measure of how fast the system is learning.

The Converse

If a gate is removed or its enforcement lapses, the boundary contracts. Errors that were caught deterministically leak to the semantic tier, where they are caught probabilistically, at higher cost, and later in the pipeline. The boundary migrates in both directions. It must be actively maintained.

Architecture

Build bigger verifiers.
Not bigger generators.

Model size is primarily a liveness parameter. Once your gates are sound, bigger generators mostly buy throughput.

Small model

→

Deterministic gates

→

Large verifier

→

Output

Pay large-model prices for a few hundred tokens of evaluation,
not thousands of tokens of generation.

"This architecture inverts the industry's current scaling strategy. Everyone is building bigger generators. The framework says to build bigger verifiers instead."

Model Size as Liveness Parameter

A corollary of the framework is that model size is primarily a liveness parameter. Once your gates are sound for the properties they check, bigger generators mostly buy you throughput: a higher chance that a proposal clears the gates on the first try.

This has a precondition: the model must be capable of producing a good solution at least some of the time. When it can't, retries don't converge, they just burn time. When it can, the picture is simple: the gates define what "good enough" means, and the model keeps proposing until something passes.

The Architecture

The practical consequence is architectural. On tasks where a small model has a non-trivial chance of proposing a passing solution, put your compute budget in the verifier, not the generator:

A small, fast, cheap model proposes candidate artifacts
Deterministic gates kill the obvious failures instantly, at near-zero cost
Only candidates that survive structural verification reach the expensive stochastic verifier: a large model rendering judgment

Parallelism and Diversity

Run the cheap generators in parallel. Different initializations produce diverse candidates. The deterministic gates filter most of them before the expensive verifier ever sees them. Wall-clock time drops. Cost drops.

And the cross-family verification research confirms that using a different model family for verification produces better results than self-verification, because correlated failure modes between generator and verifier are the enemy.

Past the Deterministic Ceiling

Correctness comes from what the gates can actually certify. Deterministic gates can prove structural facts: "valid JSON," "typechecks," "tests pass," "matches the schema." Past the deterministic ceiling, you are back in the world of judgment rather than guarantees, and model capacity becomes a correctness lever again because the system cannot fully observe the property you care about.

Apply This

Start with the gates you already have.
Formalize them.

The goal is not the perfect gate. It is a pipeline where the composition catches what individual gates cannot.

1. Identify your gates.
Plan review, design review, tests, linters. You likely already have them. Name them.

2. Split deterministic from stochastic.
For each gate, identify what can be checked mechanically (structure, syntax, schema) vs. what requires judgment (intent, quality, coherence). Automate the deterministic checks first.

3. Measure overlap.
Adjust until each gate catches errors the others miss. If two gates reject the same things, you have redundancy, not depth.

Getting Started

Start with the review steps you already have. If you review plans before writing code, that is a gate. If you review designs before implementing them, that is a gate. If you run tests and linters, those are deterministic sub-verifiers.

Formalize them. For each gate, identify what can be checked deterministically (structure, completeness, syntax, schema conformance) and what requires judgment (intent alignment, design quality, architectural coherence). Automate the deterministic checks first. Then add a stochastic verifier for the judgment calls. Measure the overlap ratio between gates and adjust until each one is catching errors the others miss.

Before You Have Data

You don't need data to start. The four properties work as design heuristics before they become measurements. You can reason structurally that a linter and a type checker have high overlap, while a plan review and a code review have low overlap because they see different artifact types. A legal document pipeline might have a citation validator (deterministic) and a reasoning-quality reviewer (stochastic). The same overlap question applies. Design the topology; measurement refines it once the system is running.

For the stochastic tier, classify what your LLM reviewer catches by error type. The gap between what deterministic gates prove and what gets escalated to the oracle is the stochastic verifier's contribution.

Open Frontier: Revision

One frontier remains open: the revision cycle. When an agent's work fails a gate, the next attempt passes only 31% of the time. Agents generate well but revise poorly.

The framework assumes each attempt is independent. But after rejection, the agent tries again conditioned on feedback. That second attempt is not independent of the first. The feedback often steers the agent sideways rather than toward correctness. Whether the answer is a different agent, a different way of decomposing the feedback, or something else entirely is an open question. This is the weakest link in the pipeline and the most promising area for improvement.

Open Frontier: Training

A second frontier is training. The same verification topology that filters outputs at inference time can shape weight updates at training time. Reward models in RLHF are gates. Constitutional AI uses a stochastic sub-verifier during training. Process reward models gate intermediate reasoning steps. The formal vocabulary of overlap ratios, verification amplification, and the deterministic ceiling applies in both regimes. Whether the gate arrangement matters more than when you apply it is a question worth answering.

Open Frontier: Domain Generalization

A third frontier is domain generalization. This framework was developed in software engineering, where verification infrastructure is mature. The abstract structure—intent projected through stages into artifacts, with gates checking consistency between projections—should apply wherever AI agents produce concrete output. Testing it in domains like legal reasoning, scientific analysis, or operational decision-making would validate or refine the four properties.

Reproducibility

Published data.
Open scripts.
Replicate it yourself.

The empirical analysis, methodology, and analysis tools are published alongside this framework.

Resource	What It Contains
Gate Analysis	5,109 checks, error taxonomy, decomposition data
543 Hours	Workflow methodology, operational patterns
gate_analyzer.py	Run this on your own Claude Code logs to replicate

Gate Analysis · 543 Hours · michael.roth.rocks

Key References

Brown, B., et al. (2024). "Large Language Monkeys: Scaling Inference Compute with Repeated Sampling."
Snell, C., et al. (2024). "Scaling LLM Test-Time Compute Optimally Can Be More Effective Than Scaling Model Parameters."
Lu, J., et al. (2025). "When Does Verification Pay Off? A Closer Look at LLMs as Solution Verifiers." arXiv:2512.02304.
Lightman, H., et al. (2023). "Let's Verify Step by Step." OpenAI.
Melo, L., et al. (2025). Alignment undecidability formalization.
Sohl-Dickstein, J. (2023). "The hot mess theory of AI misalignment." Blog post.
Hägele, A., et al. (2026). "The Hot Mess of AI." Anthropic Fellows Program. ICLR 2026.
Rothrock, M. (2025–2026). "543 Hours of Autonomous AI." michael.roth.rocks
Rothrock, M. (2026). "Gate Analysis." michael.roth.rocks

Run It Yourself

pip install google-genai
export GEMINI_API_KEY=<key>

# Analyze your own Claude Code logs
python gate_analyzer.py discover        # Auto-discover gate tools
python gate_analyzer.py extract         # Extract gate checks
python gate_analyzer.py classify        # Classify decisions
python gate_analyzer.py classify-errors # Classify error types
python gate_analyzer.py stats           # Summary statistics

Limitations

Small n: Data comes from a small number of operators' workflows
Classification accuracy: Error type labels assigned by Gemini
Confounding variables: Release arcs may have lower incoherence because they operate on more mature codebases
The revision cycle: 31% recovery rate is the weakest link—the framework doesn't yet address post-rejection dynamics

About

I design, build, and deploy high-autonomy AI agent systems. This research comes from that practice. If you have interesting problems, I'd love to hear about them.

LinkedIn · GitHub · michael.roth.rocks

Reliability is not a model property.It's a property of the arrangement.

The Narrative Arc

Prior Work

About

LLM output looks right.Fails unpredictably.

The CI/CD Gap

The Practitioner's Question

They stopped making nodes reliable.They made protocols reliable.

The Inference-Scaling Literature

Why This Must Be a Systems Question

Human Intent

The Framework

5,109 gate checks.87% of errors are predictable.

The Study

Three Surprises

The Omission Trade-Off

Per-Gate Breakdown

Each gate must catch what others miss.Redundancy is the enemy.

Overlap Ratio

Verification Amplification

A Note on Precision

The ceiling is structural.Not statistical.

The Deterministic Ceiling

Source of Truth

The Liveness Constraint

The Four Properties Together

The deterministic ceiling rises.Humans retreat.

The Three Regimes

Oracle Routing

How Boundaries Migrate

Predictive Power

The Converse

Build bigger verifiers.Not bigger generators.

Model Size as Liveness Parameter

The Architecture

Parallelism and Diversity

Past the Deterministic Ceiling

Start with the gates you already have.Formalize them.

Getting Started

Before You Have Data

Open Frontier: Revision

Open Frontier: Training

Open Frontier: Domain Generalization

Published data.Open scripts.Replicate it yourself.

Key References

Run It Yourself

Limitations

About

Reliability is not a model property.
It's a property of the arrangement.

LLM output looks right.
Fails unpredictably.

They stopped making nodes reliable.
They made protocols reliable.

5,109 gate checks.
87% of errors are predictable.

Each gate must catch what others miss.
Redundancy is the enemy.

The ceiling is structural.
Not statistical.

The deterministic ceiling rises.
Humans retreat.

Build bigger verifiers.
Not bigger generators.

Start with the gates you already have.
Formalize them.

Published data.
Open scripts.
Replicate it yourself.