97 Days of Logs

What happens when AI
runs while you sleep

543 hours of work I didn't do myself

One person. Six concurrent projects. 165 shipped releases. The infrastructure that made it possible—and the receipts to prove it.

Introduction

There is a great divide in the conversation about the application of AI to coding practices. While some people report massive personal productivity gains, many don't share this experience. It's easy to leave this as a difference in skills, but what techniques do the top practitioners use that get outsized performance?
This document answers that question by examining the actual Claude Code logs of one high performance user. It shows the general workflow and mechanisms used to produce large amounts of production-ready output across multiple concurrent projects.
The person has 35 years of professional experience in SaaS and software engineering, both from implementation and management. The processes and supporting tools encode the engineering management best practices and lessons learned over a long career.

How to Read This Document

This presentation has three viewing modes. Press D to cycle between them:

ModePurposeBest For
PresentationFull-screen slides with key insightsTalks, quick overview
DetailSlide + expanded analysis belowDeep reading, exploration
DocumentScrollable long-form with all contentReference, printing

The slides tell the story. The detail sections provide evidence, methodology, and nuance. You're currently in Detail or Document mode—that's why you can see this text.

What You'll Learn

This document answers one question: How does a user get AI coding assistants to work autonomously over long periods to produce complex deliverables at an acceptable level of quality?

This is shown through a forensic examination of chat logs with notes about why the user made the relevant decisions. It is taken from 97 days of real logs from one developer running six concurrent projects with Claude Code.

The Journey

SectionWhat It Covers
The HookA single prompt that triggered 13 hours of autonomous work
The Numbers543 hours, roughly $500/month, and what the math actually looks like
The PyramidWhere human time goes when AI handles execution
The InfrastructureFour pillars that make long-range autonomy possible
The EconomicsWhy quality gates pay for themselves
Start MondayConcrete steps to begin building your own scaffolding
AppendicesMethodology, cost breakdown, code quality evidence

The Core Claim

With the right infrastructure, one person can produce the output of multiple engineering teams in parallel—not by working harder, but by building scaffolding that lets AI magnify the human strengths.

The rest of this document shows exactly how it works, with receipts.

The Data

Everything here came from
one developer's chat logs.

Not a team. Not a survey. Parsed directly from 97 days of Claude Code sessions— six concurrent projects, 2,314 agent sessions, all verifiable.

14,926 prompts  ·  2,314 agent sessions  ·  543 autonomous hours

Why This Matters

When you see aggregate statistics—thousands of prompts, hundreds of hours—the natural assumption is "a team did this." That assumption makes the data feel distant, organizational, not personally achievable.

The truth: this is what happens when you build the right infrastructure—and let it run.

The Leverage Equation

Traditional scaling requires hiring. You need more people to do more work. The constraint is headcount, budget, coordination overhead.

With the right AI scaffolding, one person can:

  • Run 6 projects concurrently (each receiving 5-7 person-equivalent output)
  • Execute 543 hours of autonomous work over 97 days
  • Ship 165 releases that passed CI/CD
  • Run a 4-layer automated verification pipeline (unit → E2E → Playwright → visual)

Infrastructure, Not Talent

This output isn't the result of working 16-hour days. It's the result of:

  • Scaffolding: 60 MCP tools that encode workflow knowledge
  • Process: Four-phase workflow (exploration → planning → implementation → review)
  • Review gates: 2,974 quality checks—Gemini validates Claude's work against WAF pillars
  • Cost prevention: Bad ideas rejected at $1 (design phase), not $100 (production)

The rest of this presentation shows exactly how it works—and how you can replicate it.

The Hook

1 prompt complete feature release
"Kicked it off. Came back 13 hours later. Done."
Analyze
40 min
Design
40 min
Implement
10+ hours
Ship
2 hours
What shipped: Backend services + event pipeline + data layer + frontend components + Terraform + integration tests

Click to reveal

"I typed one sentence. Went to bed. Woke up to a deployable release."

What "One Prompt" Actually Produced

This isn't replicating the legacy process by spawning agents to play the roles we assign to humans in legacy practice. The orchestrator executed a well-structured plan—reading dependency chains from the database, sequencing work accordingly, monitoring progress, and adapting.

The Four Phases That Made This Possible

That 13-hour autonomous run was part of a larger workflow:

PhaseModeWhat Happens
1. Exploration Interactive Human + AI discover what to build. Back-and-forth discussion, research, prototyping ideas.
2. Planning Largely autonomous AI creates tasks, captures dependency chains, structures the release. Gemini validates via review_plan.
3. Implementation Fully autonomous Orchestrator reads the plan from the database and executes it. This is the "13 hours" part.
4. Review Autonomous loop Gemini runs review_code on output. Agent fixes issues until it passes. 867 review calls total.

Key insight: The orchestrator didn't "figure out" dependencies—it read dependency chains that were captured during planning. The leverage is in the setup, not just the execution.

13 hours autonomous
4 waves of agents
1,212 files touched
6,670 code edits

What Actually Shipped

Backend (Go)
  • Domain state machine (3 states, transition rules)
  • New data model + repository layer
  • Core service with business logic
  • Pub/Sub event pipeline (3 event types)
  • Upsert operations with conflict resolution
Frontend + Infra
  • TypeScript interfaces + display components
  • Reusable UI component library additions
  • Terraform configs for new resources
  • Cloud Build pipeline updates
  • Integration tests (full coverage)

Autonomous Project Management

After receiving the prompt, the orchestrator's first action was to analyze the work:

"Current v2.5 Status: 60 total tasks (13 cancelled). 40 tasks need design (must go through design_documents → review_design before implementation). 7 tasks in todo (can be claimed directly)."

It identified that most tasks were blocked by dependencies. Its strategic response:

  1. Spawn a Backend Design Agent to work through the design tasks (unblocking the chain)
  2. Spawn a Frontend Agent to work on ready-to-go tasks in parallel

The Orchestration Loop

After each wave completed, the orchestrator checked status and decided what to do next:

TimeProgressOrchestrator DecisionAgents
+0.0h0%Analyze dependencies → spawn design + frontend agents2
+0.6h18%Design done → spawn implementation agents for Phases 1-33
+11.0h58%Core complete → spawn agents for remaining work3
+12.3h68%Almost done → spawn final frontend + test agents2
+12.9h100%"v2.5: Event Processing Pipeline - COMPLETE!"

Why This Matters

The orchestrator did the work of a project manager executing a plan:

  • Dependency reading: Queried the database for pre-captured dependency chains
  • Strategic sequencing: Started with design to unblock implementation (following the graph)
  • Progress monitoring: Checked status after each wave
  • Adaptive execution: Adjusted agent focus based on what remained
  • Completion verification: Confirmed all tasks done before reporting complete

The heavy lifting happened earlier: during exploration (figuring out what to build) and planning (capturing the structure and acceptance criteria). The autonomous execution during implementation is return on that up-front investment.

Scale of the Work

Analysis of the actual session logs shows what those 10 agents did:

MetricCountWhat It Means
Shell commands21,464Tests run, builds triggered, git ops
File reads8,474Understanding existing code
Code edits6,670Modifications to source files
Files created983New source files written
Source files touched1,212.go, .ts, .tf, .proto, etc.
Total tool calls66,325Autonomous actions taken

Note: These numbers are from parsing the actual JSONL chat logs, not estimates.

The Actual Prompt (Implementation Phase)

Please review the docs/prompts exposed as resources by the project MCP
to understand the process then spawn appropriate restricted agents
in the foreground to burn down the tasks in [Release v2.5].

That's it. 47 words. But this prompt only works because of what it points to.

What Made This Possible: The Resource Documents

The key phrase is "review the docs/prompts exposed as resources." Those resources define the entire workflow:

ResourceWhat It Teaches the AI
development-workflow-sequenceComplete methodology: walking skeleton approach, 4-phase workflow, quality gates
determine-needed-agentsDecision matrix: when to spawn agents vs. do work yourself, capacity guidelines
escalation-decision-matrix5-level escalation framework with specific triggers for each level
coordinator-troubleshooting-guideReal-world patterns: "agent claims are 50% accurate—always verify with database"

These aren't vague guidelines—they're operational playbooks with decision trees, copy-paste commands, and hard-won lessons from production use. The AI reads them, internalizes the process, then executes it.

The workflow lives in the resources, not in the prompt.

The Other Enabler: Structured Data

The resources teach the process. The database provides the data:

  • Task graph: 60 tasks with dependency chains captured during planning
  • Status tracking: Which tasks are ready, blocked, or complete
  • Release scope: What belongs to this release vs. future work
SELECT * FROM tasks WHERE release_id = 'v2.5'
  AND status = 'ready'
  AND all_dependencies_complete = true

Resources define how to work. The database defines what to work on. Together, they enable a 47-word prompt to trigger 13 hours of coherent execution.

Terminology

TermDefinition
ArcA period of autonomous work triggered by one user prompt. Can last minutes (quick arc) or hours (release arc).
OrchestratorThe main AI session that manages the workflow—analyzing tasks, spawning agents, monitoring progress.
Work itemA discrete unit of work tracked in the project database—a feature, bug fix, test, or refactor.
WaveA batch of agents spawned to work in parallel, followed by a status check before the next wave.

The Question

Chat is not a workflow.

Most AI adoption stops at "type a prompt, get a response." That's using a power tool as a hammer. What happens when you build around it instead?

"The gains don't come from better prompts. They come from different infrastructure."

Two Modes of AI Use

Most people use AI in one mode:

  1. Interactive: Human types, AI responds, human reads, repeat. The AI accelerates individual tasks but doesn't change the workflow structure.

The second mode is different:

  1. Autonomous: Human defines the work, AI executes without supervision, human reviews the result. The workflow runs while you're away.

What Makes Autonomous Mode Possible

You can't just "tell the AI to do more." You need infrastructure:

  • Tools: APIs and commands the AI can call without human mediation
  • State: Persistent context so the AI knows where it is in a workflow
  • Gates: Automated checks that catch errors before they compound

The rest of this presentation shows what that infrastructure looks like—and how it produces 543 hours of work from one person.

Related Reading

Ivan Zhao's essay "Steam, Steel, and Infinite Minds" (December 2025) explores similar themes—how AI changes knowledge work at an organizational level.

The Pattern

29 release arcs.
Same structure every time.

97 days of release arcs
"If v2.5 happened once, you'd call it a demo. A fluke."
4.1 hours v2.5 (12.9h) 17.3 hours
← sorted by duration | Oct 2025 – Jan 2026 →
29 RELEASE ARCS  |  298 of 543 HOURS  |  1 PROMPT

Click to reveal the pattern

"The 13-hour run wasn't an outlier—it was just Tuesday."

The Boring Miracle

If the v2.5 story impressed you, you should be skeptical. One impressive demo proves nothing.

But over 97 days, we ran the exact same pattern 29 times. That's not luck—it's a stable orbit.

The Numbers

29 release arcs
298.5 of 543 total hours
10.3 avg hours/arc
253 agents spawned

The Boring Part

Here's the most boring part of this story: the prompt never changed.

For all 29 release arcs—spanning 298 of the 543 total autonomous hours—the trigger was identical:

Please review the docs/prompts exposed as resources by the project MCP
to understand the process then spawn appropriate restricted agents
in the foreground to burn down the tasks in [Release].

Stable input. Stable system. Stable output.

Duration Distribution

The 29 release arcs ranged from 4.1 hours to 17.3 hours:

DurationCountExamples
4-6 hours6Smaller releases, single-domain work
6-10 hours5Medium complexity, multiple services
10-14 hours14Full feature releases like v2.5
14+ hours4Large multi-system integrations

Why It's Reproducible

The pattern works because the scaffolding is consistent:

  • Process docs as resources: Every orchestrator reads the same playbook
  • Task database: Structured work queue with dependencies
  • Review gates: Quality checkpoints at each stage
  • Restricted agents: Sandboxed execution with clear scope

The orchestrator isn't doing magic. It's following a well-documented process—the same way a good PM would.

The Investment Behind Reproducibility

This scaffolding didn't appear overnight. It evolved over 5 months:

  • ~June 2025: Philosophy codified in ~700 LOC of shell scripts (review gates)
  • Aug 2025: MCP initial commit—philosophy embedded in Go + SQLite
  • Aug-Jan: 525 commits, 2 → 60 tools, single → multi-model

What the 29 release arcs demonstrate is the payoff—a mature system in operation. The scaffolding was built incrementally, one tool at a time.

v2.5 In Context

The v2.5 release (highlighted in teal above) was our 21st release arc. At 12.9 hours, it was actually slightly above average—not exceptional.

What made it a good example for this presentation:

  • Recent enough to have detailed logs
  • Complex enough to show multi-wave orchestration
  • Clean execution with no interruptions

But any of the 29 would tell a similar story.

The Constraint Changed

Not time-limited anymore.
Token-limited.

Before
██░░░░░░░░░░░░░░
↓ wait
░░░░██░░░░░░░░░░
↓ wait
░░░░░░░░██░░░░░░
Serial: One thing at a time
Now
project-mcp   ████████░░
saas-platform       ██████░░░░
web-portal    █████████░
workflow-platform  ███████░░░
api-v2       ████░░░░░░
cloud-finops  ██████░░░░
Parallel: 6 teams running concurrently

543 hours of agent work in 3 months—across 6 parallel projects.
Scaling = more projects. Bottleneck = your bandwidth to steer them.

The Bottleneck Shifted

EraBottleneckScaling Strategy
TraditionalHuman hoursHire more people
CopilotHuman attentionAI assists, human still bottleneck
AutonomousAPI tokensRun more agents in parallel

Why This Matters

Human time is finite and non-purchasable. API tokens are purchasable. This changes the economics:

  • Old question: "Do I have time to do this?"
  • New question: "Is this worth the tokens?"

In Practice

I hit Claude Code's token limits when trying to scale up. The constraint is no longer my calendar—it's my API budget. That's a good problem to have.

"The goal isn't to make AI faster. It's to remove you from the critical path."

The Math

543 hours agent work ÷ 90 days = 6.0 hours/day
But I worked ~2 hours/day on this project
→ 3× parallelism achieved (limited by token caps)

Organization-Level Output

This isn't just volume—it's multiple engineering teams running in parallel:

ProjectSessionsDomain
project-mcp774The MCP powering this workflow
saas-platform756SaaS product
web-portal182Web platform
workflow-platform141Legacy polyglot event-driven microservices monorepo
api-v2~200+Monorepo SaaS
cloud-finops305AI-powered FinOps intelligence platform

Each project gets the equivalent of a 5-7 person team. One person + AI scaffolding = output of multiple engineering teams in parallel.

The Numbers

The math that makes
hands-off work.

543 autonomous hours
$500 per month
$2.76 per hour
25-40× leverage
5%
of arcs
48%
of autonomous hours

Power law: 29 release arcs (avg 10h each) delivered nearly half of all autonomous work.

Cost Breakdown

ItemMonthly Cost
Claude Max+ (2 accounts, rotating)$400
OpenAI/Google API (Codex, Gemini)~$100
Total~$500/month

The Mindset Shift

"I used to try to conserve tokens. That's the wrong mental model. Tokens = work getting done. The more tokens I'm using, the more the AI is doing for me. The flat rate helped me get over that mental barrier."

The Leverage Math

543 hours of autonomous work ÷ ~3 months = ~6 hours/day of Claude working independently.

At $100-150/hour for an engineer, that's $54,000-81,000/month equivalent for $500.

Note: This is the payoff from 5 months of scaffolding development—525 commits building from shell scripts to 60 MCP tools.

But Is It Slop?

No. Here's why:

Anti-Slop IndicatorEvidence
Verification exists4-layer automated pipeline (unit → E2E → Playwright → visual)
It deploys165 releases passed CI/CD pipelines
It was planned356 design docs (PRDs, specs, architecture)
Infra is real1,956 Terraform + CI/CD file modifications
It's documented2,891 markdown file modifications
It's been reviewed2,974 quality gate checks (WAF pillars, pattern enforcement)

Slop is code-only, test-free, and doesn't ship. This is PRD → Design → Code → Test → Deploy → Document. Full vertical stack.

More importantly: quality is enforced at the $1 phase (design review), not discovered at the $100 phase (production). The 2,974 gate checks aren't overhead—they're the reason this work has value.

The Key Discovery

46% stays interactive.
That's the point.

Value Delivery
18%
of arcs → 68% of hours
Momentum
37%
of arcs → 10% of hours
Steering & Alignment
46%
of arcs → 22% of hours
"The 46% is where decisions get made. That's the work."

The Three Tiers

TierArc Types% of Arcs% of HoursPurpose
Steering & Alignmentinteractive, review46%22%Human judgment, decisions, quality gates
Momentumquick, build37%10%Routine tasks, keep progress moving
Value Deliveryfeature, release, debug18%68%Major work, overnight capability

Five Key Lessons from 650 Arcs

  1. Autonomy is a spectrum, not a switch. "Maximize autonomous time" is the wrong goal. "Right autonomy level for right work" is the goal.
  2. Power law: 5% of arcs = 48% of hours. Don't measure success by "% autonomous." Measure by impact of autonomous work.
  3. The burn down pattern is THE unlock. 80.6% of arcs have "release" intent. This single pattern enables overnight autonomous releases.
  4. Interactive work is where value gets directed. Review arcs prove guardrails work. Interactive arcs prove humans stay in control.
  5. Momentum + Power Moves = Velocity. Quick arcs maintain momentum. Release arcs deliver value. The MIX creates sustainable velocity.

Arc Type Breakdown (650 arcs analyzed)

Type%Avg DurationAgentsDescription
review24.9%23 min0Human-driven code/design review
quick22.0%5 min2.7Fast single tasks
interactive20.9%33 min0Direct conversation, Q&A
build14.5%33 min4.4Test/build cycles
feature11.8%112 min9.4Multi-task implementation
release4.5%618 min8.7Full release burn down
debug1.4%118 min1.2Investigation cycles

The Problem

Chat is not autonomy.

Conversational AI (Today)

  • Human feeds context
  • AI generates code
  • Human copy-pastes
  • Human runs tests
  • Human interprets results

You're the glue code.

Orchestrated Autonomy (Goal)

  • Human sets goal
  • System reads task queue
  • System spawns agents
  • Agents run tools, update state
  • Human reviews completed work

The system is the API.

Two Modes of Interaction

Analysis reveals 42% of prompts are repetitive commands (structured), while 58% are adaptive collaboration (context-specific steering).

Structured Commands

PatternCountExample
Context compaction703/compact
Delegation template403"Please review docs/prompts..."
Confirmations376"Yes please"
Review tools324"Run review_code on R70"

Adaptive Collaboration (58%)

The "noise" represents real-time steering: bug investigation, architecture decisions, UI feedback, cross-session coordination. This is the human-in-the-loop providing context templates can't capture.

The Setup

What would you give
a new contractor?

"You wouldn't hand someone a ticket and say 'fix the bug' then disappear. You'd point them to the repo, show them how to run tests, explain the standards, and ask for a PR."

That's the infrastructure this presentation is about.

What You Give a New Contractor

  • Context: Codebase tour, documentation, domain knowledge
  • Tools: How to run tests, deploy, access logs
  • Process: PR review, code standards, definition of done
  • Feedback: Reviews, iteration, course correction

The Parallel

AI needs the same scaffolding. The difference: you can codify it once and reuse it forever. The process docs become MCP resources. The review criteria become automated gates. The feedback becomes tool output.

But There's a Deeper Insight...

This metaphor is useful—but it's still task-oriented. You're thinking about how to guide someone through work.

The real unlock is moving from task delegation to outcome delegation. From micromanaging steps to setting boundaries.

We'll return to this after examining the collaboration spectrum.

The Spectrum

Three levels of AI collaboration.

L1

Conversational Copilot

Text-in, text-out. Human is the API. Most teams are here.

L2

Tool-Using Assistant

AI has tools (file read, shell, search). Still reactive—can't plan or chain.

L3

Orchestrated Autonomy

AI operates within a system. Reads queue, spawns workers, enforces standards.

What Moves from Human to System

ResponsibilityL1L2L3
Code generationAIAIAI
File operationsHumanAIAI
Running testsHumanAIAI
Task planningHumanHumanAI (validated)
Quality reviewHumanHumanSystem (gates)
Progress trackingHumanHumanSystem (state)
Error recoveryHumanHumanAI (respawn)

The leap from L2 to L3: The system handles planning, review, and state—not just execution.

The Shift

Delegate outcomes,
not tasks.

Manager

"Write function X, then call it from Y, then update the tests."

Executive

"Complete Release 5.9."

"The infrastructure isn't instructions. It's boundaries.
Inside those boundaries: freedom."

The Kindergarten Teacher Lesson

In US military leadership training, commanders are sent to observe an elementary school playground. They arrive thinking:

"I am going to control every movement of every kid on that playground."

Of course, this doesn't work. No commander—no matter how skilled—can micromanage 30 children at recess.

But the kindergarten teacher succeeds effortlessly. How?

"Set boundaries. Give freedom within them."
  • The fence defines the playground (not the path each child takes)
  • The rules prohibit certain behaviors (not prescribe every action)
  • The bell signals transitions (not a step-by-step schedule)

This is exactly how L3 autonomy works.

The Trigger Prompt

Here's the actual prompt that launched 29 release arcs totaling 298 hours:

Please review the docs/prompts exposed as resources by the project MCP
to understand the process then spawn appropriate restricted agents
in the foreground to burn down the tasks in [Release].

Notice what it doesn't specify:

  • Which files to edit
  • Which tools to use
  • What order to work in
  • How to solve problems

It only specifies:

  • Where to find the rules: docs/prompts as MCP resources
  • What constraints exist: restricted agents, foreground visibility
  • What the outcome is: burn down the tasks in Release

Task Delegation vs Outcome Delegation

AspectTask Delegation (L2)Outcome Delegation (L3)
You specifyHOW to do each stepWHAT success looks like
Agent decidesNothing (follows script)Implementation details
Control viaInstructions (prescriptive)Boundaries (prohibitive)
Scales toMinutes of workHours of work
Human roleOperatorExecutive

Infrastructure as Boundaries

Each component constrains a failure mode:

ComponentNot This (Instructions)But This (Boundaries)
Task Database"Work on Task 42""Here's the queue; claim what's ready"
Real Tools"Use grep, then sed""Here are your tools; choose wisely"
Process Docs"Follow steps 1-10""Here's the process; adapt to context"
Review Gates"Format code this way""Pass review or revise"

Why This Matters

Task delegation doesn't scale. If you have to specify every action, you become the bottleneck. You've just built a voice-controlled IDE.

Outcome delegation scales. You define success once, encode the boundaries, and let agents find the path. The boundaries scale. You don't have to.

This is why the 13-hour release arc was possible. Not because the AI is smart—but because the boundaries were clear.

The Infrastructure

Four things agents need.

1. A Task Database

Machine-readable queue with status, dependencies, blocking rules.

11,956 tracking calls

2. Real Tools

Same access as human engineers: file system, shell, search, deploy.

56,315 Bash calls

3. Process Docs

Tools report position: "step 5/12, next: run tests." Stateful tools, stateless agents.

13h release arcs

4. Review Gates

Automated checks that catch errors. A different model reviews the work.

2,974 review calls

These define the boundaries. Inside them: freedom.

How the Infrastructure Grew

This infrastructure wasn't built in a week. It emerged over 5 months of iterative development:

ComponentStarted AsEvolved Into
Task DatabaseManual task listsSQLite task queue with status machine, dependencies, auto-transitions
Real ToolsBasic Bash + Read/Write60 MCP tools including knowledge graph, analytics, LLM guidance
Process DocsInformal patternsProcess docs exposed as MCP resources, codified delegation prompts
Review Gates2 shell scripts (~700 LOC)review_code, review_plan, review_design with multi-model validation

Philosophy First, Tools Second

The guiding principle—walking skeleton, TDD, objective verification—was codified in CLAUDE.md before any tooling existed. The shell scripts implemented review gates. The MCP embedded them into persistent tools. The multi-model orchestration refined them further.

What never changed: The philosophy. What evolved: the implementation.

The Key Architectural Choice: LLMs Evaluating LLMs

Review Gates have a specific mechanism: the model that does the work is not the model that reviews it.

  • Claude implements (writes code, creates designs)
  • Gemini evaluates (enforces criteria encoded by the human)
  • Human legislates (sets the criteria once, applied 2,974 times)

This separation of execution and evaluation is what makes the 543 autonomous hours trustworthy. See the Review Gates deep-dive for the full mechanism.

Infrastructure 1

The Task Database

A SQLite queue with status machine, dependencies, and blocking rules.

work_tracking
6,847
project_planning
5,109
project_status
3,634
task_queue
2,723

Why This Matters

11,956 calls to tracking/planning tools means agents are constantly asking: "What is my goal? What is the current status? How do I report progress?"

Without structured state, the agent is flying blind. With it, the orchestrator knows what to do next.

Implementation

A project database (SQLite via MCP) with tables for tasks, releases, and status. Agents query it constantly, update it as they work, and the orchestrator uses it to decide what to spawn next.

Infrastructure 2

60 MCP Tools

Same access as human engineers: file system, shell, search, deploy.

Bash
56,315
Read
28,846
Edit
20,058
Grep
9,062

The Ratio Tells a Story

Bash (execute) > Read (understand) > Edit (change) > Grep (search). This is the same ratio you'd see from a productive engineer.

Key Insight

"The bulk of work isn't abstract reasoning—it's concrete, small, verifiable actions."

Practical Tip

Don't build new AI-specific tools. Wrap your existing linter, test runner, and deploy script. The AI can use the same commands you use.

Infrastructure 3

Process Docs

Tools report position: "step 5/12, do this next." Stateful tools, stateless agents.

The Orchestration Loop

1. Read task queue
2. Spawn restricted worker agent
3. Worker does task, updates state
4. Worker completes or fails
5. Orchestrator checks queue
6. If tasks remain → goto 2
7. If empty → report completion
70% quick arcs (<15 min)
30% release arcs (hours)

The "Burn Down" Pattern

Please review the docs/prompts exposed as resources by the project MCP
to understand the process then spawn appropriate restricted agents
in the foreground to burn down the tasks in [Release]. Remind them to
work with pal as a partner and the tools in the project MCP.

Two Arc Types

TypeDurationUse CaseExample
Quick arcs< 15 minSingle tasks, fixesRun tests, add endpoint
Release arcs2-13+ hoursFull features, overnightv2.5: 47 items, 66K tool calls

The workers are composable units. The orchestration period is the true autonomous window—an orchestrator can spawn dozens of workers over hours.

Worker Duration Distribution

< 1 min         72   4.0%  
1-2 min       131   7.3%  ███
2-5 min       343  19.0%  █████████
5-10 min      624  34.6%  █████████████████ ← Most common
10-20 min     407  22.5%  ███████████
20-40 min     122   6.8%  ███
40+ min       106   5.9%  ██

Note: The 40+ min workers include cases like agent-ad7836b which ran for 10h 17m during v2.5, handling complex backend implementation autonomously.

The GPS Navigation Pattern

Here's the secret: agents don't memorize the route. Every tool response includes workflow guidance:

{
  "step_number": 5,
  "total_steps": 12,
  "next_step_required": true,
  "required_actions": ["Run tests", "Update documentation"],
  "guidance": "Implementation complete. Verify tests pass before proceeding.",
  "auto_fix_tasks": [...],
  "human_decisions_needed": [...],
  "escalations": [...]
}

Stateful Tools, Stateless Agents. The agent doesn't need to maintain context for 10 hours. It executes a task, gets a result, and the tool tells it what's next. The long-running arc is managed by the orchestrator and stateful tools, not a single, fragile agent context.

ComponentAnalogyWhat It Does
AgentDriverExecutes current step
Tool ResponseGPS Voice"In 500m, turn left"
Task QueueRoute PlanAll waypoints to destination
Workflow GuidanceGPS Display"Step 5/12, next: run tests"

This is why one simple prompt can trigger 13 hours of coherent work. The project-mcp is portable scaffolding—same tools, same workflow guidance, applied to any project.

Infrastructure 4

Review Gates

The model that does the work is not the model that reviews it.

Plan review_plan Design review_design Code review_code
Process docs first "review the docs/prompts..."
Plan review (Gemini) Walking skeleton enforced
Design review Scope check vs. task
Code review 867 review_code calls
State visibility 6,847 work_tracking calls

The Core Mechanism: Separation of Execution and Evaluation

The model that does the work is not the model that reviews it.

RoleWhoWhat They Do
LegislatorHumanEncodes judgment in criteria (CLAUDE.md, prompt templates)
ExecutiveClaudeImplements—writes code, creates designs, executes tasks
JudiciaryGeminiEvaluates against criteria, enforces standards

This separation prevents: AI reviewing its own work (conflict of interest), human reviewing everything (doesn't scale), or no review (dangerous).

The Triage System

Every review finding gets categorized:

CategoryWhat HappensExample
AUTO_FIXReflex-level response—agent handles automaticallyFormatting, unused imports, simple error handling
HUMAN_DECISIONRequires deliberate attention—escalate with optionsAlgorithm choice, API design trade-offs
ESCALATEStop work, require human reviewArchitecture changes, security concerns

Result: Human reviews 0% of AUTO_FIX, some HUMAN_DECISION, all ESCALATE. The criteria scale. You don't have to.

The Key Insight

"I encoded my engineering judgment in 360 lines of shell script. Gemini applied it 2,974 times. The criteria scaled. I didn't have to."

The Artifact Pipeline: What Gets Produced and Evaluated

The MCP tooling shapes what agents produce. Then the eval checks both structure and spirit:

ArtifactStructured FieldsObjective ChecksSubjective Checks
Release Tasks, phases, dependencies All tasks have IDs? Dependencies valid? Does this represent a vertical slice?
Task Acceptance criteria, priority, estimated hours Has acceptance criteria? Measurable? Right size? Not over-engineered?
Plan 7-section structure All sections present? Success criteria objectively measurable? Embodies walking skeleton? WAF pillars addressed?
Design Problem analysis, proposed solution, risk assessment Schema complete? Alternatives considered? Scope matches task? Not gold-plating?
Code Files, functions, test coverage Tests exist? Builds pass? Follows project patterns? WAF security pillar?

Key insight: The MCP tooling ensures agents produce structured artifacts with required fields. The eval then verifies both the structure (objective) and the spirit (subjective).

Example: Plan Review with WAF Pillars

When review_plan runs, it applies Google's Well-Architected Framework pillars in priority order:

  1. Operational Excellence — Monitoring, alerting, rollback strategy
  2. Security — Auth approach, data protection, secrets management
  3. Reliability — Error handling, graceful degradation, testing strategy
  4. Performance — Scalability, bottleneck identification
  5. Cost Optimization — Resource efficiency, cost awareness

This is how "embody the spirit of WAF" becomes a concrete, repeatable evaluation—not a vague guideline.

Anti-Patterns That Get Auto-Flagged

Anti-PatternWhat Gemini Flags
TDD Micro-managementTasks named "RED", "GREEN", "REFACTOR" as separate items
Premature InfrastructureCI/CD before core functionality works
Horizontal Slicing"Database Layer", "API Layer" instead of vertical features
Integration AssumptionsExternal API tasks without de-risking spikes
Over-engineeringDesign scope exceeds the specific task

Plan Review Criteria

The review tool requires 7 mandatory sections:

  1. Objective
  2. Current State Assessment
  3. Success Criteria (must be objectively measurable)
  4. Technical Design (brief—one paragraph)
  5. Implementation Steps (each with objective verification)
  6. Monitoring and Alerting
  7. Security Considerations

The Economics

Writing code is cheap.
Owning it is expensive.

20%

Initial Development

80%

Lifecycle Cost

2,974 quality gates isn't bureaucracy.
It's aggressive asset protection.

The 1-10-100 Rule (Boehm's Law)

The cost to fix a defect rises exponentially the later you find it:

$1
$10
$100
Design
(review_plan, review_design)
Development
(review_code)
Production
(Ops/Support)

The Fast Code Trap

Scenario: AI generates 10,000 lines of code in 1 hour.

  • Without gates: You just acquired 10,000 lines of unverified liability. You're in debt.
  • With gates: The system rejects what doesn't pass. You keep what works. The debt never accrues.

AI makes code generation nearly free. That's precisely why quality gates become more important, not less.

Reframing the 2,974 Invocations

Gate TypeCountValue
review_design1,687High Leverage: Stopped bad ideas before code was written.
review_code867Debt Prevention: Enforced patterns, prevented "spaghetti".
review_plan420Alignment: Ensured we built the right thing.

What the Review Agent Examines

The review_code tool isn't a simple linter. It's a fully agentic process with access to:

  • Task requirements: What was this code supposed to accomplish?
  • Design documents: What was the approved approach?
  • Actual artifacts: The code, tests, and infrastructure files
  • WAF pillars: Security, reliability, operational excellence, performance, cost

It verifies that implementation matches design, code meets task requirements, and all artifacts align. This isn't validation—it's verification.

"The most valuable work the AI did wasn't the code it wrote.
It was the 2,000+ times it told me 'No'."

Smart Triage

Decisions get captured.

Every time you make a call, the system records it. Next time: fewer questions.

AUTO_FIX
Reflex—agent handles
formatting, unused vars
HUMAN_DECISION
Deliberate—you choose
design trade-offs
ESCALATE
Full context + recs
architecture changes
"Simple stuff gets fixed automatically. Hard stuff gets escalated with context."

The Accumulation Effect

Every human decision gets captured to a per-project Knowledge Base. Next time a similar question arises, the system checks existing decisions before asking:

// Agent encounters design trade-off
search_decisions(query: "error handling style")
→ Found: "Use Result types, not exceptions" (decision: abc-123)
// Agent applies consistent pattern automatically

Why This Matters

ConcernHow the System Answers It
"How do I trust it?"Transparent triage with confidence scores + rationale
"Won't I repeat myself?"KB remembers decisions per-project
"How does it scale?"Each decision increases future autonomy

Triage in Action

From review_code prompt template:

**Triage Category Guidelines:**
- AUTO_FIX: Simple, mechanical fixes that don't change
  logic or architecture
- HUMAN_DECISION: Issues requiring design choices,
  trade-offs, or architectural decisions
- ESCALATE: Major issues requiring significant
  refactoring or cross-team coordination

**Confidence**: Rate your confidence (0.0-1.0)
**Rationale**: Why this triage category was chosen

The Compound Effect

As the Knowledge Base grows:

  • Month 1: Agent asks about error handling style
  • Month 2: Agent finds prior decision, applies consistently
  • Month 6: New patterns emerge from 50+ captured decisions
  • Month 12: Agent "knows" your project's philosophy

This is institutional knowledge that survives team changes, onboards new agents, and compounds over time.

Leverage

5% steering.
100% implementation.

Human expertise flows into the system in two ways.

Encoded (Passive)

Review criteria, CLAUDE.md, process docs. Applied 2,974 times by evals.

Direct (Active)

Occasional interventions. Short messages that redirect entire arcs.

"One sentence about bounded contexts saved 40 hours of wrong implementation."

Real Examples from Chat Logs

These are actual human interventions from the MCP development logs:

1. Enforcing Domain Boundaries

"Why does the review-service need any DB packages at all? It just uses tools to pull what it needs, it should not do direct DB access because that is not part of its bounded context."

Impact: Prevented tight coupling between services. The review-service became a stateless orchestrator that only uses tool interfaces.

2. Mandating DDD Architecture

"Please examine the knowledge graph functionality... work with pal to come up with the DDD plan and relevant bounded contexts and then create a design for this new, extracted service."

Impact: Kicked off proper architectural thinking. AI identified 4 bounded contexts in 6,200 LOC of code, designed hexagonal architecture with proper domain layer separation.

3. Correcting Storage Architecture

"Why are we putting this into Firestore instead of BQ?"

Impact: 8-word question that redirected the entire storage strategy. AI had defaulted to Firestore for graph data; corrected to BigQuery for analytical workloads.

4. Diagnosing Multi-Tenancy Flaw

"The deeper issue is that we are supposed to have a dataset per tenant. Right now, it seems like the schema is very wrong, with a dataset per microservice."

Impact: Human diagnosed the root cause of tenant isolation issues. Fundamentally shifted the infrastructure architecture from service-per-dataset to tenant-per-dataset.

The Amplification Effect

InterventionHuman TimeImplementation Time Saved/Redirected
Bounded context correction~30 seconds~4 hours of wrong coupling
DDD architecture mandate~2 minutes~20 hours of design + implementation
Storage architecture redirect~10 seconds~8 hours of wrong storage layer
Multi-tenancy diagnosis~1 minute~10 hours of debugging

Total: ~4 minutes of human expertise steering ~42 hours of implementation.

The Two Loops of Expertise

Human expertise enters the system through two complementary mechanisms:

  1. Encode Once, Apply Forever: Review criteria, process docs, CLAUDE.md. Your judgment becomes automated evaluation. (Review Gates)
  2. Intervene Occasionally, Redirect Completely: Short messages at key moments. Human pattern recognition catching AI drift.

The first loop scales without limit. The second loop ensures the first loop stays calibrated.

What Makes Good Interventions

PatternExample
Socratic questions"Why are we putting this in Firestore?"
Architectural principles"not part of its bounded context"
Root cause identification"The deeper issue is..."
Quality standards"enterprise-grade domain driven design... elegant"

Notice: None of these tell the AI how to implement. They tell it what principles to apply.

Real Example

The 6h 42m agent.

It came back with news I could act on.

The task

Create smoke test for undo feature. Test failed: unknown message type: <nil>

What it reported back

"The undo code is in the repo but not deployed here. I can't test a feature that isn't running. Deploy it first, then I'll rerun the test."

No invented solution. Just: "here's the wall, here's why I stopped."

Why This Is Success

The alternative failure modes are worse:

  • Hallucinate a fix: Agent invents code changes that "should" work
  • Retry endlessly: Agent keeps running the same failing test
  • Silent failure: Agent marks task complete without verification

This agent did the right thing: investigated, found the real blocker (deployment gap), and reported back with actionable information.

The 413 Messages

This agent sent 413 messages over 6h 42m. Breakdown:

ToolCountPurpose
Bash84Running tests, git ops, deployments
Read25Understanding existing code
Grep16Finding implementations
Edit13Creating/modifying test script
Glob6Finding files
TodoWrite5Tracking progress
MCP tools4Design docs, project status

The Takeaway

You can walk away because when it hits a wall, it tells you which wall and why. Both outcomes give you actionable information. What's unacceptable is silent garbage or invented workarounds.

The Philosophy Shift

Brooks said throw one away.
Now you can afford to.

Brooks (1975)

"Plan to throw one away; you will anyway."

Now (2025)

V1 is the question. V2 is the answer.

18,866 lines deleted in one cutover
16,424 "v2" mentions across project

The Rough Draft Method

Write it wrong
on purpose
Find the seams
from running code
Regenerate
in hours

The first version's job is to be wrong in useful ways.

The Economics Inversion

ResourceTraditionalAI-Assisted
Code productionExpensive (human-months)Cheap (tokens)
Human attentionAvailableThe bottleneck
Throwaway codeWasteInvestment in understanding
"We wrote 18,866 lines specifically to learn why we shouldn't keep them."

Implementation as Inquiry

If implementation costs tokens (cheap) instead of human-months (expensive), building V1 becomes the optimal requirements gathering technique.

  • English specs: Low-fidelity, ambiguous
  • Running code: High-fidelity, binary (works or doesn't)
  • V1: Executable speculation that exposes unknown unknowns

You're not building to ship. You're building to discover what you should build.

The Phoenix Architecture

A mindset for AI-assisted development:

PhasePurposeOutcome
V1 (Probe)Let AI build the featureDiscover where boundaries should be
AuditRead the code for structure, not syntaxIdentify coupling, duplication, gaps
V2 (Structure)Regenerate from scratch with lessonsClean implementation with proper interfaces

Key insight: Don't refactor V1. Delete and regenerate. The cost of fixing AI's "first draft" assumptions often exceeds the cost of a clean V2.

What to Preserve vs. Throw Away

Preserve (Human Attention Artifacts)Throw Away (Token Artifacts)
Interface contracts, type definitionsImplementation details
Design decisions and rationaleCode that no longer fits
Test suites (the "truth" of the system)First-draft architectures
Migration learningsDead code (causes "context pollution")

Evidence from the Logs

The KGS (Knowledge Graph Service) cutover:

  • V1: Embedded SQLite-based knowledge graph with "rigid query syntax"
  • V2: Separate service with "always-helpful natural language interface"
  • Decision: "Keep as dead code? Rejected. Adds maintenance burden, confuses developers, bloats binary."
  • Action: Delete 41 Go files (~18,866 lines), clean break

The New Engineering Heuristics

  • Build the simplest thing that could teach you something
  • Throw away before you polish
  • Version number = learning iteration count
  • "Working code" is a checkpoint, not a destination
  • If you can't explain a module's responsibility in one sentence, delete and regenerate

The Mindset Shift

"The cost of code is going to zero (in dollars, not time), so I have no ego around just throwing entire systems away once I know how to build things."

This isn't recklessness—it's disciplined iteration. The philosophy and architecture remain constant; only the implementation is fluid.

What Actually Happened

Four months of breaking things.

Before Oct

700 Lines of Hard-Won Bash

Scripts encoding every mistake I kept making. Review gates because I couldn't trust myself.

Oct 2-27

Tools That Remember

MCP let me stop repeating instructions. Zen became my rubber duck. Agents got leashes.

Oct 28 - Nov 25

Throwing Things at the Wall

Design skill, screenshots, Codex, Chrome DevTools. Half of it stuck.

Nov 26+

Models Talking to Models

Clink pipes Claude to other CLIs. I watch the work happen. This is when it got weird.

Era 0: The Shell Script Foundation

Before any MCP infrastructure existed, the core philosophy was codified in two shell scripts:

ScriptLOCWhat It Did
review-plan.sh~360Called Gemini to review plans against 7-section template, anti-patterns, walking skeleton methodology
review-artifact.sh~355Targeted artifact review with plan context, compliance checking

Key insight: The philosophy (TDD, walking skeleton, review gates) predated all tooling. The shell scripts were the MVP implementation. When the MCP was built, it embedded this same philosophy into persistent, stateful tools.

The First "Evals": LLMs Evaluating LLMs

These shell scripts were the first implementation of a key architectural pattern: the model that does the work is not the model that reviews it.

  • Era 0: Claude implements → Gemini reviews (via shell scripts calling Gemini API)
  • Era 1-3: Same pattern, but with triage (AUTO_FIX, HUMAN_DECISION, ESCALATE)
  • Today: Even the evaluation prompts are evaluated by LLMs (meta-level evals)

This separation of execution and evaluation is what enables trusted autonomy at scale.

Evolution: Scripts → 60 MCP Tools

EraToolingStateModels
Era 02 bash scriptsStatelessSingle Gemini
Era 1-360 MCP toolsSQLite persistenceMulti-model (Gemini, OpenAI, Claude)

525 commits over 5 months transformed the scaffolding. What stayed constant: the philosophy.

Key Milestones

DateMilestoneImpact
~JuneShell scripts writtenReview gates implemented as bash + Gemini
Aug 2MCP initial commitPhilosophy embedded in Go + SQLite
Oct 2Log window beginsProject MCP, zen, agent delegation
Oct 28Design skill introducedScreenshot-based UI iteration
Nov 8Codex experiments beginMulti-model debugging discovered
Nov 26Clink introducedAgentic CLI delegation (key unlock)
Dec 13Foreground subagentsReal-time visibility, better coordination

The "Stuck → Codex" Pattern

When Claude gets stuck debugging:

  1. Recognize stuckness ("issue persists", "same error")
  2. Escalate: "use clink to ask codex to investigate"
  3. Brief: "give it the full context of what you've tried"
  4. Codex identifies root cause (often in one shot)

Why it works: Model diversity breaks reasoning loops.

Where This Started

700 lines of bash.

1

Codify a Task

Write the steps as a checklist. Your first machine-readable process doc.

2

Expose One Tool

Wrap an existing command with structured output. Your linter, test runner, or deploy script.

3

Build a Linter for Logic

Write a validator for the AI's output, not just a better prompt.

Two Scripts

The 60-tool system that produced 543 autonomous hours began here:

  • review-plan.sh — Gemini API call to validate plans against a checklist
  • review-artifact.sh — Same pattern for code review

~700 lines of bash. No MCP. No SQLite. Just philosophy encoded in prompts. Everything else grew from there.

Step 1: Codify a Task

## Task: Add a dependency
1. Check if dependency already exists
2. Run `bun add <package>`
3. Verify lockfile updated
4. Run tests to catch breaking changes
5. Commit with message: "deps: add <package>"

Step 2: Expose One Tool

#!/bin/bash
# review-lint.sh
eslint . --format json 2>/dev/null || echo '{"error": "lint failed"}'

Step 3: Build a Linter for Logic

#!/bin/bash
# validate-plan.sh
for section in "Rollback" "Verification" "Security"; do
  grep -q "$section" "$1" || { echo "REJECTED: Missing $section"; exit 1; }
done
echo "APPROVED"

Two-Week Pilot

Week 1: Foundation

  • Choose one recurring task type
  • Write the process doc
  • Wrap 2-3 existing tools
  • Run AI through manually
  • Note where it goes wrong

Week 2: Validation

  • Build one validator script
  • Add a review gate
  • Run 5 tasks through
  • Measure success rate
  • Document what worked

The Split

5% triggers 48%.

Launch Commands

5% of interactions → 48% autonomous work. One sentence, hours of execution.

Steering Time

46% stays interactive. Reviews, decisions, redirects. That's where judgment lives.

"I decide what gets built. The system handles execution, iteration, testing.
The split happens at the right seam."

PRD → Design → Code → Test → Deploy → Document.
165 releases. Six projects. One person.

Why 46% Stayed Manual

The surprising part isn't that agents run 13 hours unsupervised. It's that 46% stays deliberate. That's the design working.

  • Review arcs (25%): Quality gates working as intended
  • Interactive arcs (21%): Deliberate human attention for steering, context, decisions

The system handles reflex-level work automatically. Deliberate work rises to conscious attention. That's the point.

30-Second Redirects

Short interventions that saved hours:

  • "Why does the review-service need any DB packages?" — 30 seconds → saved 4 hours of wrong coupling
  • "The deeper issue is dataset per tenant" — 1 minute → saved 10 hours of debugging
  • "Why Firestore instead of BQ?" — 10 seconds → redirected 8 hours of storage work

This is the 5% that makes the 48% possible. Human pattern recognition catching AI drift, expressed as architectural principles rather than implementation details.

What Made Hands-Off Possible

EnablerNot This
Task from a queueVague goal
Process docs read firstImprovisation
Review gates to passOptional checks
State to updateBlack box
Standards enforcedYOLO mode
Tools report positionAgent tracks state

The last row is the hidden unlock: stateful tools, stateless agents. Every tool response includes "step 5/12, do this next." The agent doesn't need to remember where it is—the tools tell it.

What Prevents Runaway Agents?

RiskConstraint
Agent pursues tangentTask-scoped work from queue
Agent makes breaking changesreview_code gate before completion
Agent can't find answerEscalation to thinking partner
Agent doesn't report statusMandatory work_tracking updates
Agent accesses wrong systemsRestricted permission level
Agent's approach is wrongreview_plan gate before implementation

The Operating Principle

"Structure enables autonomy. The 46% interactive is the control surface. The 54% runs because the boundaries are clear."

The Disposable Code Mindset

Fred Brooks said "Plan to throw one away." With AI, this becomes optimal strategy:

  • Code is ephemeral; architecture is enduring
  • V1 isn't failure—it's "executable speculation" that reveals requirements
  • 18,866 lines deleted in one cutover wasn't waste—it was tuition
  • Don't refactor AI's first draft; delete and regenerate with lessons learned

The cost of code is going to zero (in dollars). What remains valuable: interfaces, decisions, tests, and architecture.

The Economic Return on Quality

AI makes the cheap part (code generation) nearly free. But 80% of software cost is maintenance—debugging, rework, technical debt. That's where quality gates pay off:

  • review_design (1,687 calls): Bad ideas rejected at $1 (design phase)
  • review_code (867 calls): Defects caught at $10 (development)
  • review_plan (420 calls): Wrong approaches stopped before implementation

The 2,974 quality gate invocations aren't overhead—they're the mechanism that prevents $100 problems (production bugs, rework, debt). Speed without quality creates liability. Speed with quality creates value.

Playground Supervision

Kindergarten teachers don't control every movement. They set boundaries and give freedom within them.

  • The fence = infrastructure (task queue, tools, docs, gates)
  • The rules = review gates that block bad output
  • The bell = task state transitions
  • The kids = autonomous agents

Leadership shifts from micromanaging tasks to defining outcomes and constraints.

Scaling Further: From Teams to Organization

Once you're running multiple projects in parallel, a new question emerges: how do you add more?

BottleneckSolution
Token limits per projectSeparate API accounts, budget allocation
Human attention (5% steering)Portfolio-style time-boxing, async check-ins
MCP per project overheadShared tooling, templated scaffolding
Context switching costConsistent patterns across projects
Knowledge silosCross-project knowledge base, shared decisions

The new constraint: You can't steer infinite projects simultaneously. The 5% expertise amplification that enables each project still requires human attention. Scaling means optimizing attention allocation, not eliminating it.

The unlock: Shared infrastructure. When the components are templated, adding a new project means copying the boundaries—not reinventing them.

Appendix

"Does it produce
usable code?"

It ships to production. Here's what that looks like.

  1. It ships to production. 543 hours → deployed releases.
  2. Review gates enforce quality. 867 review_code calls.
  3. Same standards as humans. Gemini enforces my philosophy.
  4. 20,058 Edit calls. Iterative refinement, not one-shot.

Wrong Question

"Does AI produce usable code?" isn't the right question. "Does your system ensure usable code?" is. The infrastructure is the answer.

Engineering Department Output (98 days)

This isn't just code volume. It's the artifact diversity of a full engineering organization:

RoleArtifact TypeFile
Modifications
Backend + Frontend DevApplication Code (.go, .ts, .js)64,867
QA EngineerTest Files (*_test.go, *.test.ts)3,038
DevOps/SRETerraform, CI/CD, Dockerfiles1,956
Tech WriterDocumentation (.md)2,891
ArchitectAPI Schemas, Protos, SQL504
UI DeveloperHTML, CSS, SCSS569

Total: 90,356 file modifications across 6 active projects.

Automated Verification Layers

Quality isn't one metric. It's a fully automated verification pipeline:

LayerWhat It ChecksHow
Unit TestsCode paths work59% coverage across 41 Go packages
E2E TestsDeployed services workTests run against real infrastructure
PlaywrightBrowser flows workAutomated UI interaction tests
Visual VerificationUI matches designGemini compares screenshots to mocks

All automated. All run by agents. The 165 releases passed through this entire pipeline.

Anti-Slop Indicators

IndicatorSlopThis System
VerificationNone4-layer automated pipeline (unit → E2E → Playwright → visual)
DeploymentDoesn't ship165 releases passed CI/CD
PlanningNo design docs356 PRDs/specs created
InfrastructureNo infra1,956 Terraform/CI changes
DocumentationMissing2,891 markdown files

Sample Design Documents

Real planning artifacts produced during the study period:

  • DESIGN-PKCE-ENDPOINTS.md — OAuth PKCE implementation spec
  • PRD-PROJECT-MANAGEMENT-MCP.md — Full product requirements doc
  • TASK-913-EXPORT-SERVICE.md — Service design for export feature
  • agent-supervision-framework.md — Agent orchestration architecture
  • release-management.md — Release workflow documentation

This is the work of a product team, not a code generator.

Appendix

"What does this cost?"

$500 per month
$3.77 per autonomous hour
$17 per day
"Tokens = work getting done.
The more tokens, the more the AI is doing for me."

Appendix

The data source

Every statistic comes from actual Claude Code chat logs.
No surveys. No estimates. Parsed JSON from 97 days.

74 main sessions
2,314 subagent sessions
650 arcs analyzed
97 days of data

Data Source

Claude Code stores full conversation transcripts in ~/.claude/projects/*/ as JSONL files. Each line is a timestamped message with type, content, and tool calls.

  • Main sessions: UUID.jsonl files (can exceed 1GB for long-running sessions)
  • Agent sessions: agent-*.jsonl files (1-50MB per agent)
  • Message format: {"type": "human"|"assistant"|"tool_use", "message": {...}, "timestamp": "..."}

Arc Detection Algorithm

An arc is a period of autonomous work initiated by a single user prompt. We built a Python tool (arc_analyzer.py) that:

  1. Parses JSONL files line-by-line (streaming for memory efficiency)
  2. Detects arc starts via regex patterns:
    • burn\s+down.*(?:release|R\d+|tasks) — Release burn down
    • spawn.*(?:restricted|foreground).*agent — Explicit delegation
    • please.*review.*docs\/prompts.*spawn — Full delegation template
  3. Tracks agents via Task tool calls and their completions
  4. Detects arc ends: next substantive human prompt or completion message
  5. Classifies arcs by duration and agent count into 7 types

Classification Logic

if agents_spawned == 0:
    if duration < 15 min and "review" in prompt:
        type = "review"
    else:
        type = "interactive"
elif duration < 15 min:
    type = "quick"
elif duration < 60 min:
    type = "build"
elif "debug" in prompt:
    type = "debug"
elif duration < 240 min:
    type = "feature"
else:
    type = "release"

Verification Methods

ClaimVerification
650 arcs detectedarc_analyzer.py stats command output
29 release arcsarc_analyzer.py list --type release
5% → 48% power law29/650 = 4.5%, 299 hrs / ~620 total hrs = 48%
v2.5 exampleManual inspection of session UUID 9081bc23-*
543 autonomous hoursSum of subagent session durations from timestamps

What This ISN'T

  • NOT an estimate: Exact counts from log files
  • NOT a survey: Machine-parsed from actual conversations
  • NOT cherry-picked: ALL 650 arcs from the 97-day period are included
  • NOT hallucinated: Methodology is documented; same tools can analyze your own logs

Analyze Your Own Logs

The same methodology can be applied to your own Claude Code logs at ~/.claude/projects/:

# Extract arcs from your sessions
python arc_analyzer.py extract

# Show arc statistics
python arc_analyzer.py stats

# List release arcs only
python arc_analyzer.py list --type release

# Generate full report
python arc_analyzer.py report

The tool and methodology are fully documented. The numbers hold up to scrutiny because they're measured from real logs, not modeled or estimated.