The Hot Mess
Field data from 5,109 cross-model review gates.
Anthropic's "Hot Mess of AI" paper argues that frontier AI failures are dominated by incoherence: random, contradictory errors rather than systematic mistakes. 97 days of field data from autonomous AI development tells a different story.
5,109 gate checks · 1,450 genuine rejections · 97 days
In 2023, Sohl-Dickstein proposed the "hot mess theory of AI misalignment": more intelligent agents behave less coherently. In 2026, Hägele et al. (Anthropic) operationalized this empirically, finding that extended reasoning increases incoherence: variance-driven, unpredictable failures rather than systematic pursuit of wrong goals.
Both frame this as a model problem. I provide a practitioner's response: it's a system problem, and practitioners have been engineering trustworthy systems from unreliable components for decades.
For 97 days (October 2025 – January 2026), I operated an autonomous AI development system. Not just a model in isolation, but an orchestrated pipeline where Claude generates work, Gemini validates it through four mandatory review gates, and task decomposition that ensures no single agent ever faces a long reasoning chain.
This gave me something the paper didn't have: structured feedback on every error an AI agent made in production work, within a system designed for reliability.
| Metric | Value |
|---|---|
| Study period | Oct 2, 2025 – Jan 2026 (97 days) |
| Total gate checks | 5,109 |
| Genuine rejections | 1,450 |
| Concurrent projects | 8 |
| Autonomous hours | 543 |
| Shipped releases | 165 |
I classified every one of those 1,450 rejections and correlated them with work complexity, gate type, and recovery outcomes. The results reframe the question from model alignment to system trust, and point to a practical solution.
This research extends the 543 Hours study. See that presentation for the full autonomous workflow methodology.
The Claim
The paper decomposes AI errors into bias and variance. Their finding: variance dominates.
Wrong but consistent. The model reliably pursues the wrong approach. Predictable. Fixable with better training.
Wrong and random. The model contradicts itself, handles things inconsistently. Unpredictable. The "hot mess."
"Scale alone will not solve reliability." Implication of Sohl-Dickstein (2023) and Hägele et al. (2026)
In 2023, Sohl-Dickstein proposed the "hot mess theory of AI misalignment": more intelligent agents behave less coherently. As capability increases, agents explore more of the solution space, and that exploration produces incoherence. Hägele et al. (2026) operationalized this empirically, testing models on benchmarks:
If the paper is right, autonomous AI agents face a fundamental reliability ceiling. As tasks get harder and require more reasoning, the agent becomes increasingly unpredictable. You can't just use a better model because the problem is structural.
Both the theory and the empirical work frame the problem at the model level, as a property of the agent itself. This is the researcher's frame. The practitioner's frame is different.
Practitioners don't deploy models. They deploy systems: orchestration layers, review gates, task queues, process documentation, bounded contexts. The model is one component. The question isn't "is this model aligned?" Instead, it is "can I trust this system?"
Guaranteeing a model will always behave correctly is undecidable, just the same way you can't prove a program is bug-free. Practitioners have always known this. You don't solve it; you engineer around it with review, testing, and bounded blast radius. The question: does system-level trust engineering break the link between task complexity and incoherence?
Sohl-Dickstein, J. (2023). "The hot mess theory of AI misalignment." Blog post.
Hägele, A., et al. (2026). "The Hot Mess of AI." Anthropic Fellows Program. ICLR 2026.
The Dataset
Not two models, but one system. Four gates. Every review outcome recorded.
System, not model: Cross-model review is one component. Task decomposition, bounded contexts, and process documentation are the others.
| Gate | When | What It Verifies | Checks |
|---|---|---|---|
| review_plan | Before implementation | Approach aligns with project goals, no gaps | 1,193 |
| review_design | After design, before code | Design satisfies task acceptance criteria | 1,491 |
| codereview | During implementation | File-scoped code quality review (no project context) | 340 |
| review_code | After all work complete | Agentic verification against project plan and requirements—can search repo, pull in other files | 2,085 |
| Decision | Count | % |
|---|---|---|
| APPROVED | 2,817 | 55.1% |
| NEEDS_REVISION | 1,918 | 37.5% |
| ESCALATE | 201 | 3.9% |
| UNKNOWN | 173 | 3.4% |
I extracted review gate data from 3,119 Claude Code session JSONL files by matching tool_use blocks to their tool_result responses via tool_use_id. Gate tools were auto-discovered using Gemini to classify tool names from session metadata.
Of the 1,918 NEEDS_REVISION decisions:
Most of the 201 ESCALATE decisions were also infrastructure failures rather than genuine quality escalations.
Decision classification used a two-pass approach: regex pattern matching (2,920 classified), then Gemini semantic evaluation (422 remaining). This reduced UNKNOWN from initial extraction to 3.4%.
The paper calls ensembling "impractical for irreversible agentic tasks." Cross-model review solves that: review the output with a different model, no re-execution needed. But ensembling alone is not the main finding. The system also decomposes work into bounded tasks, externalizes state into task queues and process docs, and enforces contracts at each gate. Reliability emerges from the system architecture, not from any single component.
Every one of these 5,109 gate checks is an entry in a trust ledger: empirical evidence about what works, what fails, and where. This isn't an alignment proof. It's the kind of engineering evidence practitioners use to answer: "can I trust this system?"
How I Classified
Three error types. Each with a different cause. And a different fix.
Wrong but internally consistent. Misunderstood requirements, chose the wrong architecture, applied a pattern incorrectly. The work is coherent but incorrect.
Internally inconsistent. Handled something correctly in one place but not another. Contradicted its own plan. Random quality variation. The "hot mess."
Simply left out. Not wrong, not contradictory, just missing. A requirement was skipped or a component was not implemented.
| My Category | Paper's Term | Nature |
|---|---|---|
| Systematic | Bias | Predictable, consistent, fixable with better context |
| Incoherent | Variance | Unpredictable, contradictory, the "hot mess" |
| Omission | — | Neither bias nor variance, simply incomplete |
The paper's framework has two categories. I added a third, omission, because my data showed a large cluster of errors that were neither wrong nor contradictory, just incomplete.
Each is wrong, but the agent's work is internally consistent. It just chose the wrong approach.
The agent knew the right thing, it did it elsewhere, but failed to apply it consistently.
Nothing wrong with what was built, there's just a gap.
Each genuine rejection was classified by Gemini Flash Lite with a structured prompt requiring exactly one label: SYSTEMATIC, INCOHERENT, or OMISSION. I classified all 1,450 genuine rejections. Spot-checking showed reasonable agreement with human judgment.
The Results
The dominant failure mode isn't the "hot mess." It's forgetting things entirely.
"Agents don't usually produce wrong or contradictory work. They produce incomplete work."
| Error Type | Count | % |
|---|---|---|
| Omission | 716 | 49.4% |
| Systematic | 550 | 37.9% |
| Incoherent | 184 | 12.7% |
These 1,450 rejections form a trust ledger. Each is empirical evidence about what goes wrong when AI agents do real work within an engineered system. The data answers a practitioner's question: "can I trust this system?", not a researcher's question: "is this model aligned?"
The theory's predictions hold for models in isolation. Within an engineered system, the picture is different:
Nearly half of all errors are omissions where the agent simply left something out. A requirement was skipped, a component wasn't implemented, etc. This is a solvable problem: checklists and review gates catch omissions reliably.
Another 38% are systematic — wrong approach, coherently executed. Together, 87% of errors are the kind you can catch with structured process.
| Claim | Paper Predicts | My Data Shows |
|---|---|---|
| Dominant error type | Incoherence (variance) | Omission (49%) |
| Incoherence prevalence | Majority of errors | 12.7% of errors |
| Error predictability | Low (random failures) | High (87.3% predictable) |
Decomposition vs Extended Reasoning
The paper predicts 11-hour release arcs should show the highest incoherence. They show among the lowest, because no single agent ever reasons for 11 hours.
"Decomposition breaks the link between task complexity and reasoning chain length, the underlying mechanism the paper identifies as driving incoherence."
Sohl-Dickstein's mechanism: extended reasoning → more self-contradiction → incoherence scales with chain length. System decomposition changes the operating regime.
Release arcs are the longest by far, averaging 11 hours of total work. The theory predicts they should show the highest incoherence. They show among the lowest (10.0%). Build arcs (9.4%) are similarly low. Feature arcs, where agents work in longer, less decomposed chains, show the highest (19.8%).
The key: no single agent in a release arc reasons for 11 hours. An orchestrator decomposes the work into bounded tasks, each handled by a fresh agent with a scoped context. The "long run" is emergent from many short runs.
Feature arcs involve novel functionality such as new APIs, new UI components, or new integrations. They are the least decomposed work type: a single agent often carries a feature from start to finish, reasoning over a longer chain in unfamiliar territory. This is the exact scenario the paper predicts will produce the most incoherence, and it does.
Build and release arcs use the "burn down" pattern, which achieves low incoherence through decomposition:
The system prevents any single model from accumulating state across an 11-hour reasoning chain because the system never asks it to. Incoherence is a model property; decomposition is a system property. The system sidesteps the model's limitation.
| Agents | Incoherent | Systematic | Omission | n |
|---|---|---|---|---|
| 0–2 | 12.6% | 34.3% | 53.1% | 983 |
| 3–5 | 16.3% | 40.4% | 43.3% | 104 |
| 6–10 | 13.5% | 44.2% | 42.3% | 52 |
| 11–20 | 14.5% | 30.6% | 54.8% | 62 |
| 20+ | 10.2% | 40.8% | 49.0% | 49 |
Incoherence rates are roughly flat at 10–16% regardless of how many agents are involved. The paper's concern that multi-agent systems compound incoherence is not supported, though neither is a strong claim that more agents help.
| Arc Type | Rejection Rate | n |
|---|---|---|
| Interactive | 36.5% | 3,249 |
| Build | 42.2% | 455 |
| Quick | 42.9% | 233 |
| Feature | 49.3% | 294 |
| Release | 52.4% | 126 |
Release arcs have the highest rejection rate (52.4%) despite the lowest incoherence. The gates are doing real work on the hardest tasks, e.g. catching systematic and omission errors before they compound.
Where Errors Live
The plan gate catches the most errors. By the time work reaches code review, most issues are gone.
The shift-left effect: Plans are rejected 61% of the time because the plan gate does the heavy lifting.
Each gate catches a different error profile:
Plans have the lowest incoherence rate (10.5%) while code reviews have the highest (16.0%). The dominant failure mode across all gates is omission, where agents forget things rather than contradict themselves.
The two code review tools have fundamentally different scope:
Incoherence is a cross-context problem: implementation contradicts the plan, or file A handles something differently than file B. A file-scoped review literally cannot see these contradictions. It can only catch systematic errors (bad patterns within a file) and omissions (missing pieces within a file). Detecting incoherence requires system-level observability. What you can see depends on where you sit in the system and a component with a narrow interface can only catch errors within its bounded context.
| Gate | First-Pass Approval | Rejection Rate |
|---|---|---|
| review_plan | 39.2% | 60.8% |
| review_design | 62.6% | 37.4% |
| review_code | 60.5% | 39.5% |
| codereview | 72.3% | 27.7% |
The plan gate catches the most errors—60.8% rejection rate—filtering problems before they reach downstream gates. Design and code reviews see progressively lower rejection rates as upstream gates have already caught the worst issues.
If you only have budget for one review gate, make it plan review. It catches the most errors at the lowest cost, before any code is written. The 61% rejection rate at the plan stage is a positive finding, because this is the least expensive place to catch bugs.
Each rejection is an entry in the trust ledger, providing empirical evidence about where this system catches errors and what kinds it catches. Instead of trying to prove alignment, the system is accumulating evidence of trustworthiness, gate by gate.
The Real Hot Mess
The genuine incoherence signal doesn't appear in the initial work, it manifests in the process that follows rejection.
recovery rate after rejection
| After NEEDS_REVISION... | % |
|---|---|
| Then APPROVED | 31.5% |
| Then REJECTED AGAIN | 54.8% |
| Then ESCALATED / OTHER | 13.7% |
When an agent's work fails review, the next attempt passes only 31.5% of the time. 55% of the time, it fails again. Recovery varies by error type, though the overall rate is low:
| Error Type | Recovery Rate | n |
|---|---|---|
| Incoherent | 45.0% | 20 |
| Omission | 36.5% | 85 |
| API Error | 37.3% | 67 |
| Systematic | 31.0% | 100 |
Incoherent errors have the highest recovery rate (45%), possibly because the agent already knows the right approach but it applied it inconsistently. Systematic errors are hardest to recover from (31%), requiring a fundamentally different approach.
The initial work is reasonably good and the system keeps incoherence to 12.7%. But the revision cycle is where the system's weakest interface lives:
This is a state handoff failure at a system boundary. The system decomposes initial work well, but the revision interface, where review feedback must cross into the agent's next attempt, is the weakest link in the pipeline.
Three Takeaways
The theory is right: extended reasoning produces incoherence. But perhaps the answer lies not in the model, but building systems that avoid it.
1. The theory is right—for models. Build systems.
11-hour release arcs show 10% incoherence — because no single agent reasons for 11 hours. Decomposition sidesteps the mechanism.
2. Engineer trust, not alignment.
Alignment is undecidable. Trust is measurable. 5,109 gate checks are a trust ledger providing empirical evidence the system works.
3. Check for completeness, not just coherence.
49% of errors are omissions where the model simply left things out. Checklists and review gates catch the biggest error class.
Sohl-Dickstein proposed the theory. Hägele et al. measured it on benchmarks. My data shows what happens when you deploy models within an engineered system.
| Paper's Setup | My Setup |
|---|---|
| Single model, single task | Multiple models, cross-validation |
| Extended reasoning in one context | Decomposed into bounded tasks with fresh contexts |
| No external structure | Task queues with dependencies |
| No process documentation | Agents read process docs first |
| No review gates | Four mandatory review gates |
| Free-form reasoning | Scoped, restricted agents |
The hot mess theory is right about its core mechanism: extended reasoning produces incoherence. But this is a model property, and practitioners don't deploy models, they deploy systems.
Guaranteeing a model will always behave correctly is undecidable in the same way you can't prove a program is bug-free (the halting problem). Framed this way, the alignment problem is unsolvable. Practitioners already know this. You don't solve it. You engineer trust.
Apollo 11 didn't work because every component was formally verified. It worked because NASA engineered a system where failures were caught, contained, and recoverable. The astronauts trusted the system, not the code. The 5,109 gate checks are a trust ledger, the same kind of engineering evidence. Not a proof of alignment, but empirical evidence that the system works.
If you're deploying AI agents, invest in trust engineering:
Reproducibility
The analysis tool is generic so you can run it on your own Claude Code session data to replicate the study.
| Component | Tool |
|---|---|
| Gate extraction | gate_analyzer.py: streaming JSONL parser |
| Decision classification | Regex patterns + Gemini Flash Lite |
| Error classification | Gemini Flash Lite (structured prompt) |
| Correlation analysis | SQLite joins across databases |
| Arc classification | arc_analytics.db (13,049 arcs from 45 sessions) |
Foundation study: 543 Hours of Autonomous Work in 97 Days
pip install google-genai export GEMINI_API_KEY=<key> # Analyze your own Claude Code logs (defaults to ~/.claude/projects) python gate_analyzer.py discover # Auto-discover gate tools python gate_analyzer.py extract # Extract gate checks python gate_analyzer.py classify # Classify decisions python gate_analyzer.py classify-errors # Classify error types python gate_analyzer.py stats # Summary statistics python gate_analyzer.py error-analysis # Arc correlations # Or point at a specific directory python gate_analyzer.py extract --source-dir /path/to/jsonl/logs
| Source | Size | Content |
|---|---|---|
| Session JSONL files | 3,119 files | Not public (contains proprietary project content) |
| gate_analytics.db | ~5 MB | 5,109 extracted gate checks |
| arc_analytics.db | ~8 MB | 13,049 classified work arcs |
The raw session data is not publicly available, but the tool is generic so anyone using Claude Code with review gate MCP tools can reproduce this analysis on their own logs.
Michael Rothrock is a software engineering leader with 35 years of experience building trusted systems. This research documents patterns discovered through daily use of autonomous AI agents across 8 concurrent projects.
Addendum: Population Test
I tested the taxonomy against DataClaw's public dataset. Unstructured work confirms the hot mess hypothesis.
| Error Type | Population | Gated | Delta |
|---|---|---|---|
| Systematic | 39.3% | 37.9% | +1.4pp |
| Incoherent | 25.2% | 12.7% | +12.5pp |
| Omission | 35.5% | 49.4% | -13.9pp |
Incoherence scales with length in unstructured work (23% → 27%), but decomposition flattens the slope (2.7pp vs 8.1pp increase).
664 public AI coding sessions from the DataClaw project (Pete O'Mallet). Sessions span Dec 2025 – Feb 2026, three contributors, 85% Opus-class models with additional coverage of GPT-5, Kimi, MiniMax, and GLM. Each session includes timestamped messages and tool call sequences. Errors classified by Gemini using the same SYSTEMATIC/INCOHERENT/OMISSION taxonomy.
| Session Length | Errors | Incoherent | Rate |
|---|---|---|---|
| 10-29 turns | 172 | 40 | 23.3% |
| 30-99 turns | 451 | 107 | 23.7% |
| 100-299 turns | 842 | 210 | 24.9% |
| 300+ turns | 819 | 219 | 26.7% |
| Length | Unstructured | Decomposed | Δ |
|---|---|---|---|
| Short (< 30) | 23.0% | 22.0% | +1.0pp |
| Medium (30-99) | 25.3% | 22.5% | +2.8pp |
| Long (100+) | 31.1% | 24.7% | -6.4pp |
Unstructured sessions see an 8.1pp increase from short to long. Decomposed sessions see 2.7pp. The gap widens with length because task decomposition moderates the relationship between duration and incoherence.
| Error Type | Population | Gate-Mediated |
|---|---|---|
| Systematic | 57.2% | 31.0% |
| Incoherent | 56.9% | 45.0% |
| Omission | 43.3% | 36.5% |
Population recovery is higher because humans correct obvious errors. Gates catch cross-context issues that humans don't surface, likely a harder class of errors.
The single-operator limitation is partially addressed. The same directional findings appear in independent data from different practitioners: incoherence is elevated without structure, and decomposition moderates the length-incoherence relationship. The gated workflow doesn't just reduce errors, it specifically suppresses the incoherent ones.