The Hot Mess

Anthropic says AI agents are incoherent.
I engineered around it.

Field data from 5,109 cross-model review gates.

Anthropic's "Hot Mess of AI" paper argues that frontier AI failures are dominated by incoherence: random, contradictory errors rather than systematic mistakes. 97 days of field data from autonomous AI development tells a different story.

5,109 gate checks  ·  1,450 genuine rejections  ·  97 days

The Context

In 2023, Sohl-Dickstein proposed the "hot mess theory of AI misalignment": more intelligent agents behave less coherently. In 2026, Hägele et al. (Anthropic) operationalized this empirically, finding that extended reasoning increases incoherence: variance-driven, unpredictable failures rather than systematic pursuit of wrong goals.

Both frame this as a model problem. I provide a practitioner's response: it's a system problem, and practitioners have been engineering trustworthy systems from unreliable components for decades.

Why I Could Test This

For 97 days (October 2025 – January 2026), I operated an autonomous AI development system. Not just a model in isolation, but an orchestrated pipeline where Claude generates work, Gemini validates it through four mandatory review gates, and task decomposition that ensures no single agent ever faces a long reasoning chain.

This gave me something the paper didn't have: structured feedback on every error an AI agent made in production work, within a system designed for reliability.

MetricValue
Study periodOct 2, 2025 – Jan 2026 (97 days)
Total gate checks5,109
Genuine rejections1,450
Concurrent projects8
Autonomous hours543
Shipped releases165

This Presentation

I classified every one of those 1,450 rejections and correlated them with work complexity, gate type, and recovery outcomes. The results reframe the question from model alignment to system trust, and point to a practical solution.

This research extends the 543 Hours study. See that presentation for the full autonomous workflow methodology.

The Claim

As AI reasons longer, errors get
random, not wrong.

The paper decomposes AI errors into bias and variance. Their finding: variance dominates.

Bias (Systematic)

Wrong but consistent. The model reliably pursues the wrong approach. Predictable. Fixable with better training.

Variance (Incoherent)

Wrong and random. The model contradicts itself, handles things inconsistently. Unpredictable. The "hot mess."

"Scale alone will not solve reliability." Implication of Sohl-Dickstein (2023) and Hägele et al. (2026)

The Theory

In 2023, Sohl-Dickstein proposed the "hot mess theory of AI misalignment": more intelligent agents behave less coherently. As capability increases, agents explore more of the solution space, and that exploration produces incoherence. Hägele et al. (2026) operationalized this empirically, testing models on benchmarks:

  1. Extended reasoning increases incoherence. As models think longer, they don't just get more wrong, they get more randomly wrong.
  2. Model scale has a complex relationship with coherence. Bigger models aren't simply more coherent.
  3. Natural "overthinking" spikes incoherence dramatically. When models ruminate, quality degrades unpredictably.
  4. Ensembling mitigates incoherence but the paper calls it "impractical for irreversible agentic tasks."

Why This Matters for Agents

If the paper is right, autonomous AI agents face a fundamental reliability ceiling. As tasks get harder and require more reasoning, the agent becomes increasingly unpredictable. You can't just use a better model because the problem is structural.

The Gap: Models vs Systems

Both the theory and the empirical work frame the problem at the model level, as a property of the agent itself. This is the researcher's frame. The practitioner's frame is different.

Practitioners don't deploy models. They deploy systems: orchestration layers, review gates, task queues, process documentation, bounded contexts. The model is one component. The question isn't "is this model aligned?" Instead, it is "can I trust this system?"

Guaranteeing a model will always behave correctly is undecidable, just the same way you can't prove a program is bug-free. Practitioners have always known this. You don't solve it; you engineer around it with review, testing, and bounded blast radius. The question: does system-level trust engineering break the link between task complexity and incoherence?

References

Sohl-Dickstein, J. (2023). "The hot mess theory of AI misalignment." Blog post.
Hägele, A., et al. (2026). "The Hot Mess of AI." Anthropic Fellows Program. ICLR 2026.

The Dataset

5,109 quality checks.
Two models. Four gates.

Not two models, but one system. Four gates. Every review outcome recorded.

5,109 total gate checks
55% approved first pass
1,450 genuine rejections
8 concurrent projects

System, not model: Cross-model review is one component. Task decomposition, bounded contexts, and process documentation are the others.

The Four Review Gates

GateWhenWhat It VerifiesChecks
review_plan Before implementation Approach aligns with project goals, no gaps 1,193
review_design After design, before code Design satisfies task acceptance criteria 1,491
codereview During implementation File-scoped code quality review (no project context) 340
review_code After all work complete Agentic verification against project plan and requirements—can search repo, pull in other files 2,085

Decision Distribution

DecisionCount%
APPROVED2,81755.1%
NEEDS_REVISION1,91837.5%
ESCALATE2013.9%
UNKNOWN1733.4%

Data Extraction

I extracted review gate data from 3,119 Claude Code session JSONL files by matching tool_use blocks to their tool_result responses via tool_use_id. Gate tools were auto-discovered using Gemini to classify tool names from session metadata.

Data Quality

Of the 1,918 NEEDS_REVISION decisions:

  • 456 were API errors or infrastructure failures (classified as API_ERROR by Gemini)
  • 12 could not be classified
  • Net genuine rejections: 1,450

Most of the 201 ESCALATE decisions were also infrastructure failures rather than genuine quality escalations.

Decision classification used a two-pass approach: regex pattern matching (2,920 classified), then Gemini semantic evaluation (422 remaining). This reduced UNKNOWN from initial extraction to 3.4%.

Why a System, Not Just Cross-Model?

The paper calls ensembling "impractical for irreversible agentic tasks." Cross-model review solves that: review the output with a different model, no re-execution needed. But ensembling alone is not the main finding. The system also decomposes work into bounded tasks, externalizes state into task queues and process docs, and enforces contracts at each gate. Reliability emerges from the system architecture, not from any single component.

Every one of these 5,109 gate checks is an entry in a trust ledger: empirical evidence about what works, what fails, and where. This isn't an alignment proof. It's the kind of engineering evidence practitioners use to answer: "can I trust this system?"

How I Classified

Every rejection tells you
what went wrong.

Three error types. Each with a different cause. And a different fix.

Systematic

Wrong but internally consistent. Misunderstood requirements, chose the wrong architecture, applied a pattern incorrectly. The work is coherent but incorrect.

Incoherent

Internally inconsistent. Handled something correctly in one place but not another. Contradicted its own plan. Random quality variation. The "hot mess."

Omission

Simply left out. Not wrong, not contradictory, just missing. A requirement was skipped or a component was not implemented.

Mapping to the Paper's Framework

My CategoryPaper's TermNature
SystematicBiasPredictable, consistent, fixable with better context
IncoherentVarianceUnpredictable, contradictory, the "hot mess"
OmissionNeither bias nor variance, simply incomplete

The paper's framework has two categories. I added a third, omission, because my data showed a large cluster of errors that were neither wrong nor contradictory, just incomplete.

Real Examples from Feedback

Systematic

  • "Critical Architectural Mismatch between Go/TypeScript and Firestore indexes"
  • "Testing with in-memory fake doesn't verify GCS SDK interaction"
  • "Using polling when the API supports webhooks"

Each is wrong, but the agent's work is internally consistent. It just chose the wrong approach.

Incoherent

  • Error handling present in one endpoint but missing in an identical one
  • Plan says "use Redis" but design specifies in-memory cache
  • Consistent logging in 3 of 4 services, none in the 4th

The agent knew the right thing, it did it elsewhere, but failed to apply it consistently.

Omission

  • "Missing IP extraction strategy"
  • "No monitoring section in the design"
  • "Test cases for edge conditions not implemented"

Nothing wrong with what was built, there's just a gap.

Classification Method

Each genuine rejection was classified by Gemini Flash Lite with a structured prompt requiring exactly one label: SYSTEMATIC, INCOHERENT, or OMISSION. I classified all 1,450 genuine rejections. Spot-checking showed reasonable agreement with human judgment.

The Results

Only 13% are incoherent.
Agents forget, not contradict.

The dominant failure mode isn't the "hot mess." It's forgetting things entirely.

Omission
49.4%
Systematic
37.9%
Incoherent
12.7%
"Agents don't usually produce wrong or contradictory work. They produce incomplete work."

Full Distribution

Error TypeCount%
Omission71649.4%
Systematic55037.9%
Incoherent18412.7%

What This Means

These 1,450 rejections form a trust ledger. Each is empirical evidence about what goes wrong when AI agents do real work within an engineered system. The data answers a practitioner's question: "can I trust this system?", not a researcher's question: "is this model aligned?"

The theory's predictions hold for models in isolation. Within an engineered system, the picture is different:

  • 87.3% of errors are predictable: either omission (missing component) or systematic (wrong approach)
  • Only 12.7% match the "hot mess" pattern: internal contradictions, inconsistent quality
  • Omissions are 4× more common than incoherence

The Finding: Agents Forget, Not Contradict

Nearly half of all errors are omissions where the agent simply left something out. A requirement was skipped, a component wasn't implemented, etc. This is a solvable problem: checklists and review gates catch omissions reliably.

Another 38% are systematic — wrong approach, coherently executed. Together, 87% of errors are the kind you can catch with structured process.

Comparison to the Paper's Prediction

ClaimPaper PredictsMy Data Shows
Dominant error typeIncoherence (variance)Omission (49%)
Incoherence prevalenceMajority of errors12.7% of errors
Error predictabilityLow (random failures)High (87.3% predictable)

Decomposition vs Extended Reasoning

Don't reason longer.
Decompose instead.

The paper predicts 11-hour release arcs should show the highest incoherence. They show among the lowest, because no single agent ever reasons for 11 hours.

Incoherent Systematic Omission
Feature
20%
Quick
18%
Interactive
12%
Release
10%
Build
9%
"Decomposition breaks the link between task complexity and reasoning chain length, the underlying mechanism the paper identifies as driving incoherence."

The Headline Finding

Sohl-Dickstein's mechanism: extended reasoning → more self-contradiction → incoherence scales with chain length. System decomposition changes the operating regime.

Release arcs are the longest by far, averaging 11 hours of total work. The theory predicts they should show the highest incoherence. They show among the lowest (10.0%). Build arcs (9.4%) are similarly low. Feature arcs, where agents work in longer, less decomposed chains, show the highest (19.8%).

The key: no single agent in a release arc reasons for 11 hours. An orchestrator decomposes the work into bounded tasks, each handled by a fresh agent with a scoped context. The "long run" is emergent from many short runs.

Why Feature Arcs Are the Worst

Feature arcs involve novel functionality such as new APIs, new UI components, or new integrations. They are the least decomposed work type: a single agent often carries a feature from start to finish, reasoning over a longer chain in unfamiliar territory. This is the exact scenario the paper predicts will produce the most incoherence, and it does.

The Decomposition Mechanism

Build and release arcs use the "burn down" pattern, which achieves low incoherence through decomposition:

  • Task queue with explicit dependencies: complex work broken into bounded units
  • Fresh agent per task: each worker starts with a clean context, no accumulated confusion
  • Process documentation: agents read methodology before starting, not after struggling
  • Four mandatory review gates: catch errors between tasks, before they compound

The system prevents any single model from accumulating state across an 11-hour reasoning chain because the system never asks it to. Incoherence is a model property; decomposition is a system property. The system sidesteps the model's limitation.

Agent Count: No Meaningful Effect

AgentsIncoherentSystematicOmissionn
0–212.6%34.3%53.1%983
3–516.3%40.4%43.3%104
6–1013.5%44.2%42.3%52
11–2014.5%30.6%54.8%62
20+10.2%40.8%49.0%49

Incoherence rates are roughly flat at 10–16% regardless of how many agents are involved. The paper's concern that multi-agent systems compound incoherence is not supported, though neither is a strong claim that more agents help.

Rejection Rate by Complexity

Arc TypeRejection Raten
Interactive36.5%3,249
Build42.2%455
Quick42.9%233
Feature49.3%294
Release52.4%126

Release arcs have the highest rejection rate (52.4%) despite the lowest incoherence. The gates are doing real work on the hardest tasks, e.g. catching systematic and omission errors before they compound.

Where Errors Live

Plans fail by omission.
Catch it before code exists.

The plan gate catches the most errors. By the time work reaches code review, most issues are gone.

Gate First-pass approval rate
Plan Review
39%
Code Review
61%
Design Review
63%
Codereview
72%

The shift-left effect: Plans are rejected 61% of the time because the plan gate does the heavy lifting.

Error Patterns by Gate

Each gate catches a different error profile:

  • Plans: 54% omission, 35% systematic, 10.5% incoherent: plans mostly miss things
  • Design: 48% systematic, 36% omission, 16% incoherent: designs pick wrong approaches
  • Code: 55% omission, 29% systematic, 16% incoherent: code leaves things out
  • Codereview: 56% systematic, 44% omission, 0% incoherent: but not because the code is perfectly coherent (see below)

Plans have the lowest incoherence rate (10.5%) while code reviews have the highest (16.0%). The dominant failure mode across all gates is omission, where agents forget things rather than contradict themselves.

Why Codereview Shows 0% Incoherence

The two code review tools have fundamentally different scope:

  • review_code is fully agentic. It has access to the project database, task requirements, and can search and pull in other files from the repo. It reviews artifacts against the plan. It finds 16% incoherence.
  • codereview is file-scoped. It reviews only the specific files it's given, with no project context. It finds 0% incoherence.

Incoherence is a cross-context problem: implementation contradicts the plan, or file A handles something differently than file B. A file-scoped review literally cannot see these contradictions. It can only catch systematic errors (bad patterns within a file) and omissions (missing pieces within a file). Detecting incoherence requires system-level observability. What you can see depends on where you sit in the system and a component with a narrow interface can only catch errors within its bounded context.

Gate Pipeline Effectiveness

GateFirst-Pass ApprovalRejection Rate
review_plan39.2%60.8%
review_design62.6%37.4%
review_code60.5%39.5%
codereview72.3%27.7%

The plan gate catches the most errors—60.8% rejection rate—filtering problems before they reach downstream gates. Design and code reviews see progressively lower rejection rates as upstream gates have already caught the worst issues.

What Each Gate Catches

  • review_plan: Missing requirements, incomplete scope (54% omission, 35% systematic)
  • review_design: Architecture mismatches, wrong approach (48% systematic, 36% omission)
  • review_code: Missing components, incomplete implementation (55% omission, 29% systematic)
  • codereview: Wrong patterns, missing pieces (56% systematic, 44% omission)

The Implication

If you only have budget for one review gate, make it plan review. It catches the most errors at the lowest cost, before any code is written. The 61% rejection rate at the plan stage is a positive finding, because this is the least expensive place to catch bugs.

Each rejection is an entry in the trust ledger, providing empirical evidence about where this system catches errors and what kinds it catches. Instead of trying to prove alignment, the system is accumulating evidence of trustworthiness, gate by gate.

The Real Hot Mess

Agents generate well.
They revise poorly.

The genuine incoherence signal doesn't appear in the initial work, it manifests in the process that follows rejection.

31%

recovery rate after rejection

After NEEDS_REVISION...%
Then APPROVED31.5%
Then REJECTED AGAIN54.8%
Then ESCALATED / OTHER13.7%

The Pattern

When an agent's work fails review, the next attempt passes only 31.5% of the time. 55% of the time, it fails again. Recovery varies by error type, though the overall rate is low:

Error TypeRecovery Raten
Incoherent45.0%20
Omission36.5%85
API Error37.3%67
Systematic31.0%100

Incoherent errors have the highest recovery rate (45%), possibly because the agent already knows the right approach but it applied it inconsistently. Systematic errors are hardest to recover from (31%), requiring a fundamentally different approach.

Why This Is the Real "Hot Mess"

The initial work is reasonably good and the system keeps incoherence to 12.7%. But the revision cycle is where the system's weakest interface lives:

  • They fail to incorporate specific feedback and the handoff from review output to revision input loses information
  • They sometimes fix one issue while introducing another leading to state corruption across the feedback boundary
  • They lose context about what the reviewer actually wanted, i.e. the interface contract between "feedback" and "next attempt" is underspecified

This is a state handoff failure at a system boundary. The system decomposes initial work well, but the revision interface, where review feedback must cross into the agent's next attempt, is the weakest link in the pipeline.

Implications

  • Plan for multiple review cycles, not one. Budget for 2-3 rounds.
  • Consider different agents for revision. A fresh agent may be more effective than having the same agent retry.
  • Front-load quality. Getting it right the first time is far more efficient than revision (only 31% of revisions recover).

Three Takeaways

Decompose, don't scale.
Engineer trust, not alignment.

The theory is right: extended reasoning produces incoherence. But perhaps the answer lies not in the model, but building systems that avoid it.

1. The theory is right—for models. Build systems.
11-hour release arcs show 10% incoherence — because no single agent reasons for 11 hours. Decomposition sidesteps the mechanism.

2. Engineer trust, not alignment.
Alignment is undecidable. Trust is measurable. 5,109 gate checks are a trust ledger providing empirical evidence the system works.

3. Check for completeness, not just coherence.
49% of errors are omissions where the model simply left things out. Checklists and review gates catch the biggest error class.

Theory → Measurement → Practice

Sohl-Dickstein proposed the theory. Hägele et al. measured it on benchmarks. My data shows what happens when you deploy models within an engineered system.

Paper's SetupMy Setup
Single model, single taskMultiple models, cross-validation
Extended reasoning in one contextDecomposed into bounded tasks with fresh contexts
No external structureTask queues with dependencies
No process documentationAgents read process docs first
No review gatesFour mandatory review gates
Free-form reasoningScoped, restricted agents

The Broader Point: Trust, Not Alignment

The hot mess theory is right about its core mechanism: extended reasoning produces incoherence. But this is a model property, and practitioners don't deploy models, they deploy systems.

Guaranteeing a model will always behave correctly is undecidable in the same way you can't prove a program is bug-free (the halting problem). Framed this way, the alignment problem is unsolvable. Practitioners already know this. You don't solve it. You engineer trust.

Apollo 11 didn't work because every component was formally verified. It worked because NASA engineered a system where failures were caught, contained, and recoverable. The astronauts trusted the system, not the code. The 5,109 gate checks are a trust ledger, the same kind of engineering evidence. Not a proof of alignment, but empirical evidence that the system works.

Implications for AI Safety

  1. Engineer trust, not alignment. Alignment is undecidable; trust is measurable. Invest in orchestration, bounded contexts, and review gates. These are the same tools you'd use for any system built from unreliable components.
  2. Cross-model review is practical ensembling for agents. The paper calls ensembling impractical for irreversible agentic tasks. My data shows cross-model review achieves the same variance reduction without re-executing the task.
  3. The revision cycle is the real safety frontier. Models are reasonably good at first-pass work in decomposed environments. The 31% recovery rate after rejection shows the real incoherence problem: agents struggle to incorporate feedback, not to generate initial work.

What to Build Next

If you're deploying AI agents, invest in trust engineering:

  1. Task decomposition and orchestration: break complex work into bounded tasks; each agent gets a fresh context and a narrow scope
  2. Review gates as trust evidence: each gate check is a ledger entry; accumulated evidence answers "can I trust this system?"
  3. Completeness checks: 49% of errors are omissions, so checklists catch the biggest error class
  4. Revision strategies: the 31% recovery rate means current retry approaches are inefficient; this is the real frontier

Reproducibility

Open tools.
Bring your own logs.

The analysis tool is generic so you can run it on your own Claude Code session data to replicate the study.

ComponentTool
Gate extractiongate_analyzer.py: streaming JSONL parser
Decision classificationRegex patterns + Gemini Flash Lite
Error classificationGemini Flash Lite (structured prompt)
Correlation analysisSQLite joins across databases
Arc classificationarc_analytics.db (13,049 arcs from 45 sessions)

Run It Yourself

pip install google-genai
export GEMINI_API_KEY=<key>

# Analyze your own Claude Code logs (defaults to ~/.claude/projects)
python gate_analyzer.py discover        # Auto-discover gate tools
python gate_analyzer.py extract         # Extract gate checks
python gate_analyzer.py classify        # Classify decisions
python gate_analyzer.py classify-errors # Classify error types
python gate_analyzer.py stats           # Summary statistics
python gate_analyzer.py error-analysis  # Arc correlations

# Or point at a specific directory
python gate_analyzer.py extract --source-dir /path/to/jsonl/logs

The Dataset

SourceSizeContent
Session JSONL files3,119 filesNot public (contains proprietary project content)
gate_analytics.db~5 MB5,109 extracted gate checks
arc_analytics.db~8 MB13,049 classified work arcs

The raw session data is not publicly available, but the tool is generic so anyone using Claude Code with review gate MCP tools can reproduce this analysis on their own logs.

Limitations

  • Single operator: All data comes from one person's workflow, which may not generalize
  • Classification accuracy: Error type labels assigned by Gemini: potential systematic bias
  • Arc join coverage: Arc correlation joins gate checks to arcs via session-file identity and timestamp boundaries. Covers 85.3% of checks (4,357/5,109). The remaining 15% occur in spawned agent sessions.
  • Sample sizes: Some subcategories have small n. Error-type breakdowns for release arcs (n=50 classified rejections) and agent-count bins (n=49–52) differ from total gate-check counts (e.g., release arcs n=126 total checks).
  • Confounding variables: Release arcs may have lower incoherence because they operate on more mature codebases

References

  • Sohl-Dickstein, J. (2023). "The hot mess theory of AI misalignment." Blog post.
  • Hägele, A., et al. (2026). "The Hot Mess of AI." Anthropic Fellows Program. ICLR 2026.
  • Naur, P. (1985). "Programming as Theory Building." Microprocessing and Microprogramming.
  • Rothrock, M. (2025-2026). "543 Hours of Autonomous AI." michael.roth.rocks

About

Michael Rothrock is a software engineering leader with 35 years of experience building trusted systems. This research documents patterns discovered through daily use of autonomous AI agents across 8 concurrent projects.

LinkedIn  ·  GitHub  ·  michael.roth.rocks

Addendum: Population Test

664 public sessions.
The pattern holds.

I tested the taxonomy against DataClaw's public dataset. Unstructured work confirms the hot mess hypothesis.

Error TypePopulationGatedDelta
Systematic39.3%37.9%+1.4pp
Incoherent25.2%12.7%+12.5pp
Omission35.5%49.4%-13.9pp

Incoherence scales with length in unstructured work (23% → 27%), but decomposition flattens the slope (2.7pp vs 8.1pp increase).

Data Source

664 public AI coding sessions from the DataClaw project (Pete O'Mallet). Sessions span Dec 2025 – Feb 2026, three contributors, 85% Opus-class models with additional coverage of GPT-5, Kimi, MiniMax, and GLM. Each session includes timestamped messages and tool call sequences. Errors classified by Gemini using the same SYSTEMATIC/INCOHERENT/OMISSION taxonomy.

Incoherence by Session Length

Session LengthErrorsIncoherentRate
10-29 turns1724023.3%
30-99 turns45110723.7%
100-299 turns84221024.9%
300+ turns81921926.7%

The Key Test: Length × Decomposition

LengthUnstructuredDecomposedΔ
Short (< 30)23.0%22.0%+1.0pp
Medium (30-99)25.3%22.5%+2.8pp
Long (100+)31.1%24.7%-6.4pp

Unstructured sessions see an 8.1pp increase from short to long. Decomposed sessions see 2.7pp. The gap widens with length because task decomposition moderates the relationship between duration and incoherence.

Recovery Without Gates

Error TypePopulationGate-Mediated
Systematic57.2%31.0%
Incoherent56.9%45.0%
Omission43.3%36.5%

Population recovery is higher because humans correct obvious errors. Gates catch cross-context issues that humans don't surface, likely a harder class of errors.

What This Means

The single-operator limitation is partially addressed. The same directional findings appear in independent data from different practitioners: incoherence is elevated without structure, and decomposition moderates the length-incoherence relationship. The gated workflow doesn't just reduce errors, it specifically suppresses the incoherent ones.