The Hot Mess

Anthropic says AI agents are incoherent.
I engineered around it.

Field data from 5,109 cross-model review gates.

Anthropic's "Hot Mess of AI" paper argues that frontier AI failures are dominated by incoherence: random, contradictory errors rather than systematic mistakes. 97 days of field data from autonomous AI development tells a different story.

5,109 gate checks · 1,450 genuine rejections · 97 days

The Context

In 2023, Sohl-Dickstein proposed the "hot mess theory of AI misalignment": more intelligent agents behave less coherently. In 2026, Hägele et al. (Anthropic) operationalized this empirically, finding that extended reasoning increases incoherence: variance-driven, unpredictable failures rather than systematic pursuit of wrong goals.

Both frame this as a model problem. I provide a practitioner's response: it's a system problem, and practitioners have been engineering trustworthy systems from unreliable components for decades.

Why I Could Test This

For 97 days (October 2025 – January 2026), I operated an autonomous AI development system. Not just a model in isolation, but an orchestrated pipeline where Claude generates work, Gemini validates it through four mandatory review gates, and task decomposition that ensures no single agent ever faces a long reasoning chain.

This gave me something the paper didn't have: structured feedback on every error an AI agent made in production work, within a system designed for reliability.

Metric	Value
Study period	Oct 2, 2025 – Jan 2026 (97 days)
Total gate checks	5,109
Genuine rejections	1,450
Concurrent projects	8
Autonomous hours	543
Shipped releases	165

This Presentation

I classified every one of those 1,450 rejections and correlated them with work complexity, gate type, and recovery outcomes. The results reframe the question from model alignment to system trust, and point to a practical solution.

This research extends the 543 Hours study. See that presentation for the full autonomous workflow methodology.

The Claim

As AI reasons longer, errors get
random, not wrong.

The paper decomposes AI errors into bias and variance. Their finding: variance dominates.

Bias (Systematic)

Wrong but consistent. The model reliably pursues the wrong approach. Predictable. Fixable with better training.

Variance (Incoherent)

Wrong and random. The model contradicts itself, handles things inconsistently. Unpredictable. The "hot mess."

"Scale alone will not solve reliability." Implication of Sohl-Dickstein (2023) and Hägele et al. (2026)

The Theory

In 2023, Sohl-Dickstein proposed the "hot mess theory of AI misalignment": more intelligent agents behave less coherently. As capability increases, agents explore more of the solution space, and that exploration produces incoherence. Hägele et al. (2026) operationalized this empirically, testing models on benchmarks:

Extended reasoning increases incoherence. As models think longer, they don't just get more wrong, they get more randomly wrong.
Model scale has a complex relationship with coherence. Bigger models aren't simply more coherent.
Natural "overthinking" spikes incoherence dramatically. When models ruminate, quality degrades unpredictably.
Ensembling mitigates incoherence but the paper calls it "impractical for irreversible agentic tasks."

Why This Matters for Agents

If the paper is right, autonomous AI agents face a fundamental reliability ceiling. As tasks get harder and require more reasoning, the agent becomes increasingly unpredictable. You can't just use a better model because the problem is structural.

The Gap: Models vs Systems

Both the theory and the empirical work frame the problem at the model level, as a property of the agent itself. This is the researcher's frame. The practitioner's frame is different.

Practitioners don't deploy models. They deploy systems: orchestration layers, review gates, task queues, process documentation, bounded contexts. The model is one component. The question isn't "is this model aligned?" Instead, it is "can I trust this system?"

Guaranteeing a model will always behave correctly is undecidable, just the same way you can't prove a program is bug-free. Practitioners have always known this. You don't solve it; you engineer around it with review, testing, and bounded blast radius. The question: does system-level trust engineering break the link between task complexity and incoherence?

References

Sohl-Dickstein, J. (2023). "The hot mess theory of AI misalignment." Blog post.
Hägele, A., et al. (2026). "The Hot Mess of AI." Anthropic Fellows Program. ICLR 2026.

The Dataset

5,109 quality checks.
Two models. Four gates.

Not two models, but one system. Four gates. Every review outcome recorded.

5,109 total gate checks

55% approved first pass

1,450 genuine rejections

8 concurrent projects

System, not model: Cross-model review is one component. Task decomposition, bounded contexts, and process documentation are the others.

The Four Review Gates

Gate	When	What It Verifies	Checks
review_plan	Before implementation	Approach aligns with project goals, no gaps	1,193
review_design	After design, before code	Design satisfies task acceptance criteria	1,491
codereview	During implementation	File-scoped code quality review (no project context)	340
review_code	After all work complete	Agentic verification against project plan and requirements—can search repo, pull in other files	2,085

Decision Distribution

Decision	Count	%
APPROVED	2,817	55.1%
NEEDS_REVISION	1,918	37.5%
ESCALATE	201	3.9%
UNKNOWN	173	3.4%

Data Extraction

I extracted review gate data from 3,119 Claude Code session JSONL files by matching tool_use blocks to their tool_result responses via tool_use_id. Gate tools were auto-discovered using Gemini to classify tool names from session metadata.

Data Quality

Of the 1,918 NEEDS_REVISION decisions:

456 were API errors or infrastructure failures (classified as API_ERROR by Gemini)
12 could not be classified
Net genuine rejections: 1,450

Most of the 201 ESCALATE decisions were also infrastructure failures rather than genuine quality escalations.

Decision classification used a two-pass approach: regex pattern matching (2,920 classified), then Gemini semantic evaluation (422 remaining). This reduced UNKNOWN from initial extraction to 3.4%.

Why a System, Not Just Cross-Model?

The paper calls ensembling "impractical for irreversible agentic tasks." Cross-model review solves that: review the output with a different model, no re-execution needed. But ensembling alone is not the main finding. The system also decomposes work into bounded tasks, externalizes state into task queues and process docs, and enforces contracts at each gate. Reliability emerges from the system architecture, not from any single component.

Every one of these 5,109 gate checks is an entry in a trust ledger: empirical evidence about what works, what fails, and where. This isn't an alignment proof. It's the kind of engineering evidence practitioners use to answer: "can I trust this system?"

How I Classified

Every rejection tells you
what went wrong.

Three error types. Each with a different cause. And a different fix.

Systematic

Wrong but internally consistent. Misunderstood requirements, chose the wrong architecture, applied a pattern incorrectly. The work is coherent but incorrect.

Incoherent

Internally inconsistent. Handled something correctly in one place but not another. Contradicted its own plan. Random quality variation. The "hot mess."

Omission

Simply left out. Not wrong, not contradictory, just missing. A requirement was skipped or a component was not implemented.

Mapping to the Paper's Framework

My Category	Paper's Term	Nature
Systematic	Bias	Predictable, consistent, fixable with better context
Incoherent	Variance	Unpredictable, contradictory, the "hot mess"
Omission	—	Neither bias nor variance, simply incomplete

The paper's framework has two categories. I added a third, omission, because my data showed a large cluster of errors that were neither wrong nor contradictory, just incomplete.

Real Examples from Feedback

Systematic

"Critical Architectural Mismatch between Go/TypeScript and Firestore indexes"
"Testing with in-memory fake doesn't verify GCS SDK interaction"
"Using polling when the API supports webhooks"

Each is wrong, but the agent's work is internally consistent. It just chose the wrong approach.

Incoherent

Error handling present in one endpoint but missing in an identical one
Plan says "use Redis" but design specifies in-memory cache
Consistent logging in 3 of 4 services, none in the 4th

The agent knew the right thing, it did it elsewhere, but failed to apply it consistently.

Omission

"Missing IP extraction strategy"
"No monitoring section in the design"
"Test cases for edge conditions not implemented"

Nothing wrong with what was built, there's just a gap.

Classification Method

Each genuine rejection was classified by Gemini Flash Lite with a structured prompt requiring exactly one label: SYSTEMATIC, INCOHERENT, or OMISSION. I classified all 1,450 genuine rejections. Spot-checking showed reasonable agreement with human judgment.

The Results

Only 13% are incoherent.
Agents forget, not contradict.

The dominant failure mode isn't the "hot mess." It's forgetting things entirely.

Omission

49.4%

Systematic

37.9%

Incoherent

12.7%

"Agents don't usually produce wrong or contradictory work. They produce incomplete work."

Full Distribution

Error Type	Count	%
Omission	716	49.4%
Systematic	550	37.9%
Incoherent	184	12.7%

What This Means

These 1,450 rejections form a trust ledger. Each is empirical evidence about what goes wrong when AI agents do real work within an engineered system. The data answers a practitioner's question: "can I trust this system?", not a researcher's question: "is this model aligned?"

The theory's predictions hold for models in isolation. Within an engineered system, the picture is different:

87.3% of errors are predictable: either omission (missing component) or systematic (wrong approach)
Only 12.7% match the "hot mess" pattern: internal contradictions, inconsistent quality
Omissions are 4× more common than incoherence

The Finding: Agents Forget, Not Contradict

Nearly half of all errors are omissions where the agent simply left something out. A requirement was skipped, a component wasn't implemented, etc. This is a solvable problem: checklists and review gates catch omissions reliably.

Another 38% are systematic — wrong approach, coherently executed. Together, 87% of errors are the kind you can catch with structured process.

Comparison to the Paper's Prediction

Claim	Paper Predicts	My Data Shows
Dominant error type	Incoherence (variance)	Omission (49%)
Incoherence prevalence	Majority of errors	12.7% of errors
Error predictability	Low (random failures)	High (87.3% predictable)

Decomposition vs Extended Reasoning

Don't reason longer.
Decompose instead.

The paper predicts 11-hour release arcs should show the highest incoherence. They show among the lowest, because no single agent ever reasons for 11 hours.

Incoherent Systematic Omission

Feature

20%

Quick

18%

Interactive

12%

Release

10%

Build

9%

"Decomposition breaks the link between task complexity and reasoning chain length, the underlying mechanism the paper identifies as driving incoherence."

The Headline Finding

Sohl-Dickstein's mechanism: extended reasoning → more self-contradiction → incoherence scales with chain length. System decomposition changes the operating regime.

Release arcs are the longest by far, averaging 11 hours of total work. The theory predicts they should show the highest incoherence. They show among the lowest (10.0%). Build arcs (9.4%) are similarly low. Feature arcs, where agents work in longer, less decomposed chains, show the highest (19.8%).

The key: no single agent in a release arc reasons for 11 hours. An orchestrator decomposes the work into bounded tasks, each handled by a fresh agent with a scoped context. The "long run" is emergent from many short runs.

Why Feature Arcs Are the Worst

Feature arcs involve novel functionality such as new APIs, new UI components, or new integrations. They are the least decomposed work type: a single agent often carries a feature from start to finish, reasoning over a longer chain in unfamiliar territory. This is the exact scenario the paper predicts will produce the most incoherence, and it does.

The Decomposition Mechanism

Build and release arcs use the "burn down" pattern, which achieves low incoherence through decomposition:

Task queue with explicit dependencies: complex work broken into bounded units
Fresh agent per task: each worker starts with a clean context, no accumulated confusion
Process documentation: agents read methodology before starting, not after struggling
Four mandatory review gates: catch errors between tasks, before they compound

The system prevents any single model from accumulating state across an 11-hour reasoning chain because the system never asks it to. Incoherence is a model property; decomposition is a system property. The system sidesteps the model's limitation.

Agent Count: No Meaningful Effect

Agents	Incoherent	Systematic	Omission	n
0–2	12.6%	34.3%	53.1%	983
3–5	16.3%	40.4%	43.3%	104
6–10	13.5%	44.2%	42.3%	52
11–20	14.5%	30.6%	54.8%	62
20+	10.2%	40.8%	49.0%	49

Incoherence rates are roughly flat at 10–16% regardless of how many agents are involved. The paper's concern that multi-agent systems compound incoherence is not supported, though neither is a strong claim that more agents help.

Rejection Rate by Complexity

Arc Type	Rejection Rate	n
Interactive	36.5%	3,249
Build	42.2%	455
Quick	42.9%	233
Feature	49.3%	294
Release	52.4%	126

Release arcs have the highest rejection rate (52.4%) despite the lowest incoherence. The gates are doing real work on the hardest tasks, e.g. catching systematic and omission errors before they compound.

Where Errors Live

Plans fail by omission.
Catch it before code exists.

The plan gate catches the most errors. By the time work reaches code review, most issues are gone.

Gate First-pass approval rate

Plan Review

39%

Code Review

61%

Design Review

63%

Codereview

72%

The shift-left effect: Plans are rejected 61% of the time because the plan gate does the heavy lifting.

Error Patterns by Gate

Each gate catches a different error profile:

Plans: 54% omission, 35% systematic, 10.5% incoherent: plans mostly miss things
Design: 48% systematic, 36% omission, 16% incoherent: designs pick wrong approaches
Code: 55% omission, 29% systematic, 16% incoherent: code leaves things out
Codereview: 56% systematic, 44% omission, 0% incoherent: but not because the code is perfectly coherent (see below)

Plans have the lowest incoherence rate (10.5%) while code reviews have the highest (16.0%). The dominant failure mode across all gates is omission, where agents forget things rather than contradict themselves.

Why Codereview Shows 0% Incoherence

The two code review tools have fundamentally different scope:

review_code is fully agentic. It has access to the project database, task requirements, and can search and pull in other files from the repo. It reviews artifacts against the plan. It finds 16% incoherence.
codereview is file-scoped. It reviews only the specific files it's given, with no project context. It finds 0% incoherence.

Incoherence is a cross-context problem: implementation contradicts the plan, or file A handles something differently than file B. A file-scoped review literally cannot see these contradictions. It can only catch systematic errors (bad patterns within a file) and omissions (missing pieces within a file). Detecting incoherence requires system-level observability. What you can see depends on where you sit in the system and a component with a narrow interface can only catch errors within its bounded context.

Gate Pipeline Effectiveness

Gate	First-Pass Approval	Rejection Rate
review_plan	39.2%	60.8%
review_design	62.6%	37.4%
review_code	60.5%	39.5%
codereview	72.3%	27.7%

The plan gate catches the most errors—60.8% rejection rate—filtering problems before they reach downstream gates. Design and code reviews see progressively lower rejection rates as upstream gates have already caught the worst issues.

What Each Gate Catches

review_plan: Missing requirements, incomplete scope (54% omission, 35% systematic)
review_design: Architecture mismatches, wrong approach (48% systematic, 36% omission)
review_code: Missing components, incomplete implementation (55% omission, 29% systematic)
codereview: Wrong patterns, missing pieces (56% systematic, 44% omission)

The Implication

If you only have budget for one review gate, make it plan review. It catches the most errors at the lowest cost, before any code is written. The 61% rejection rate at the plan stage is a positive finding, because this is the least expensive place to catch bugs.

Each rejection is an entry in the trust ledger, providing empirical evidence about where this system catches errors and what kinds it catches. Instead of trying to prove alignment, the system is accumulating evidence of trustworthiness, gate by gate.

The Real Hot Mess

Agents generate well.
They revise poorly.

The genuine incoherence signal doesn't appear in the initial work, it manifests in the process that follows rejection.

31%

recovery rate after rejection

After NEEDS_REVISION...	%
Then APPROVED	31.5%
Then REJECTED AGAIN	54.8%
Then ESCALATED / OTHER	13.7%

The Pattern

When an agent's work fails review, the next attempt passes only 31.5% of the time. 55% of the time, it fails again. Recovery varies by error type, though the overall rate is low:

Error Type	Recovery Rate	n
Incoherent	45.0%	20
Omission	36.5%	85
API Error	37.3%	67
Systematic	31.0%	100

Incoherent errors have the highest recovery rate (45%), possibly because the agent already knows the right approach but it applied it inconsistently. Systematic errors are hardest to recover from (31%), requiring a fundamentally different approach.

Why This Is the Real "Hot Mess"

The initial work is reasonably good and the system keeps incoherence to 12.7%. But the revision cycle is where the system's weakest interface lives:

They fail to incorporate specific feedback and the handoff from review output to revision input loses information
They sometimes fix one issue while introducing another leading to state corruption across the feedback boundary
They lose context about what the reviewer actually wanted, i.e. the interface contract between "feedback" and "next attempt" is underspecified

This is a state handoff failure at a system boundary. The system decomposes initial work well, but the revision interface, where review feedback must cross into the agent's next attempt, is the weakest link in the pipeline.

Implications

Plan for multiple review cycles, not one. Budget for 2-3 rounds.
Consider different agents for revision. A fresh agent may be more effective than having the same agent retry.
Front-load quality. Getting it right the first time is far more efficient than revision (only 31% of revisions recover).

Three Takeaways

Decompose, don't scale.
Engineer trust, not alignment.

The theory is right: extended reasoning produces incoherence. But perhaps the answer lies not in the model, but building systems that avoid it.

1. The theory is right—for models. Build systems.
11-hour release arcs show 10% incoherence — because no single agent reasons for 11 hours. Decomposition sidesteps the mechanism.

2. Engineer trust, not alignment.
Alignment is undecidable. Trust is measurable. 5,109 gate checks are a trust ledger providing empirical evidence the system works.

3. Check for completeness, not just coherence.
49% of errors are omissions where the model simply left things out. Checklists and review gates catch the biggest error class.

Theory → Measurement → Practice

Sohl-Dickstein proposed the theory. Hägele et al. measured it on benchmarks. My data shows what happens when you deploy models within an engineered system.

Paper's Setup	My Setup
Single model, single task	Multiple models, cross-validation
Extended reasoning in one context	Decomposed into bounded tasks with fresh contexts
No external structure	Task queues with dependencies
No process documentation	Agents read process docs first
No review gates	Four mandatory review gates
Free-form reasoning	Scoped, restricted agents

The Broader Point: Trust, Not Alignment

The hot mess theory is right about its core mechanism: extended reasoning produces incoherence. But this is a model property, and practitioners don't deploy models, they deploy systems.

Guaranteeing a model will always behave correctly is undecidable in the same way you can't prove a program is bug-free (the halting problem). Framed this way, the alignment problem is unsolvable. Practitioners already know this. You don't solve it. You engineer trust.

Apollo 11 didn't work because every component was formally verified. It worked because NASA engineered a system where failures were caught, contained, and recoverable. The astronauts trusted the system, not the code. The 5,109 gate checks are a trust ledger, the same kind of engineering evidence. Not a proof of alignment, but empirical evidence that the system works.

Implications for AI Safety

Engineer trust, not alignment. Alignment is undecidable; trust is measurable. Invest in orchestration, bounded contexts, and review gates. These are the same tools you'd use for any system built from unreliable components.
Cross-model review is practical ensembling for agents. The paper calls ensembling impractical for irreversible agentic tasks. My data shows cross-model review achieves the same variance reduction without re-executing the task.
The revision cycle is the real safety frontier. Models are reasonably good at first-pass work in decomposed environments. The 31% recovery rate after rejection shows the real incoherence problem: agents struggle to incorporate feedback, not to generate initial work.

What to Build Next

If you're deploying AI agents, invest in trust engineering:

Task decomposition and orchestration: break complex work into bounded tasks; each agent gets a fresh context and a narrow scope
Review gates as trust evidence: each gate check is a ledger entry; accumulated evidence answers "can I trust this system?"
Completeness checks: 49% of errors are omissions, so checklists catch the biggest error class
Revision strategies: the 31% recovery rate means current retry approaches are inefficient; this is the real frontier

Reproducibility

Open tools.
Bring your own logs.

The analysis tool is generic so you can run it on your own Claude Code session data to replicate the study.

Component	Tool
Gate extraction	gate_analyzer.py: streaming JSONL parser
Decision classification	Regex patterns + Gemini Flash Lite
Error classification	Gemini Flash Lite (structured prompt)
Correlation analysis	SQLite joins across databases
Arc classification	arc_analytics.db (13,049 arcs from 45 sessions)

Foundation study: 543 Hours of Autonomous Work in 97 Days

Run It Yourself

pip install google-genai
export GEMINI_API_KEY=<key>

# Analyze your own Claude Code logs (defaults to ~/.claude/projects)
python gate_analyzer.py discover        # Auto-discover gate tools
python gate_analyzer.py extract         # Extract gate checks
python gate_analyzer.py classify        # Classify decisions
python gate_analyzer.py classify-errors # Classify error types
python gate_analyzer.py stats           # Summary statistics
python gate_analyzer.py error-analysis  # Arc correlations

# Or point at a specific directory
python gate_analyzer.py extract --source-dir /path/to/jsonl/logs

The Dataset

Source	Size	Content
Session JSONL files	3,119 files	Not public (contains proprietary project content)
gate_analytics.db	~5 MB	5,109 extracted gate checks
arc_analytics.db	~8 MB	13,049 classified work arcs

The raw session data is not publicly available, but the tool is generic so anyone using Claude Code with review gate MCP tools can reproduce this analysis on their own logs.

Limitations

Single operator: All data comes from one person's workflow, which may not generalize
Classification accuracy: Error type labels assigned by Gemini: potential systematic bias
Arc join coverage: Arc correlation joins gate checks to arcs via session-file identity and timestamp boundaries. Covers 85.3% of checks (4,357/5,109). The remaining 15% occur in spawned agent sessions.
Sample sizes: Some subcategories have small n. Error-type breakdowns for release arcs (n=50 classified rejections) and agent-count bins (n=49–52) differ from total gate-check counts (e.g., release arcs n=126 total checks).
Confounding variables: Release arcs may have lower incoherence because they operate on more mature codebases

References

Sohl-Dickstein, J. (2023). "The hot mess theory of AI misalignment." Blog post.
Hägele, A., et al. (2026). "The Hot Mess of AI." Anthropic Fellows Program. ICLR 2026.
Naur, P. (1985). "Programming as Theory Building." Microprocessing and Microprogramming.
Rothrock, M. (2025-2026). "543 Hours of Autonomous AI." michael.roth.rocks

About

Michael Rothrock is a software engineering leader with 35 years of experience building trusted systems. This research documents patterns discovered through daily use of autonomous AI agents across 8 concurrent projects.

LinkedIn · GitHub · michael.roth.rocks

Addendum: Population Test

664 public sessions.
The pattern holds.

I tested the taxonomy against DataClaw's public dataset. Unstructured work confirms the hot mess hypothesis.

Error Type	Population	Gated	Delta
Systematic	39.3%	37.9%	+1.4pp
Incoherent	25.2%	12.7%	+12.5pp
Omission	35.5%	49.4%	-13.9pp

Incoherence scales with length in unstructured work (23% → 27%), but decomposition flattens the slope (2.7pp vs 8.1pp increase).

Data Source

664 public AI coding sessions from the DataClaw project (Pete O'Mallet). Sessions span Dec 2025 – Feb 2026, three contributors, 85% Opus-class models with additional coverage of GPT-5, Kimi, MiniMax, and GLM. Each session includes timestamped messages and tool call sequences. Errors classified by Gemini using the same SYSTEMATIC/INCOHERENT/OMISSION taxonomy.

Incoherence by Session Length

Session Length	Errors	Incoherent	Rate
10-29 turns	172	40	23.3%
30-99 turns	451	107	23.7%
100-299 turns	842	210	24.9%
300+ turns	819	219	26.7%

The Key Test: Length × Decomposition

Length	Unstructured	Decomposed	Δ
Short (< 30)	23.0%	22.0%	+1.0pp
Medium (30-99)	25.3%	22.5%	+2.8pp
Long (100+)	31.1%	24.7%	-6.4pp

Unstructured sessions see an 8.1pp increase from short to long. Decomposed sessions see 2.7pp. The gap widens with length because task decomposition moderates the relationship between duration and incoherence.

Recovery Without Gates

Error Type	Population	Gate-Mediated
Systematic	57.2%	31.0%
Incoherent	56.9%	45.0%
Omission	43.3%	36.5%

Population recovery is higher because humans correct obvious errors. Gates catch cross-context issues that humans don't surface, likely a harder class of errors.

What This Means

The single-operator limitation is partially addressed. The same directional findings appear in independent data from different practitioners: incoherence is elevated without structure, and decomposition moderates the length-incoherence relationship. The gated workflow doesn't just reduce errors, it specifically suppresses the incoherent ones.

Anthropic says AI agents are incoherent.I engineered around it.

The Context

Why I Could Test This

This Presentation

As AI reasons longer, errors getrandom, not wrong.

Bias (Systematic)

Variance (Incoherent)

The Theory

Why This Matters for Agents

The Gap: Models vs Systems

References

5,109 quality checks.Two models. Four gates.

The Four Review Gates

Decision Distribution

Data Extraction

Data Quality

Why a System, Not Just Cross-Model?

Every rejection tells youwhat went wrong.

Systematic

Incoherent

Omission

Mapping to the Paper's Framework

Real Examples from Feedback

Systematic

Incoherent

Omission

Classification Method

Only 13% are incoherent.Agents forget, not contradict.

Full Distribution

What This Means

The Finding: Agents Forget, Not Contradict

Comparison to the Paper's Prediction

Don't reason longer.Decompose instead.

The Headline Finding

Why Feature Arcs Are the Worst

The Decomposition Mechanism

Agent Count: No Meaningful Effect

Rejection Rate by Complexity

Plans fail by omission.Catch it before code exists.

Error Patterns by Gate

Why Codereview Shows 0% Incoherence

Gate Pipeline Effectiveness

What Each Gate Catches

The Implication

Agents generate well.They revise poorly.

The Pattern

Why This Is the Real "Hot Mess"

Implications

Decompose, don't scale.Engineer trust, not alignment.

Theory → Measurement → Practice

The Broader Point: Trust, Not Alignment

Implications for AI Safety

What to Build Next

Open tools.Bring your own logs.

Run It Yourself

The Dataset

Limitations

References

About

664 public sessions.The pattern holds.

Data Source

Incoherence by Session Length

The Key Test: Length × Decomposition

Recovery Without Gates

What This Means

Anthropic says AI agents are incoherent.
I engineered around it.

As AI reasons longer, errors get
random, not wrong.

5,109 quality checks.
Two models. Four gates.

Every rejection tells you
what went wrong.

Only 13% are incoherent.
Agents forget, not contradict.

Don't reason longer.
Decompose instead.

Plans fail by omission.
Catch it before code exists.

Agents generate well.
They revise poorly.

Decompose, don't scale.
Engineer trust, not alignment.

Open tools.
Bring your own logs.

664 public sessions.
The pattern holds.