97 Days of Logs

What happens when AI
runs while you sleep

543 hours of work I didn't do myself

One person. Six concurrent projects. 165 shipped releases. The infrastructure that made it possible—and the receipts to prove it.

Introduction

There is a great divide in the conversation about the application of AI to coding practices. While some people report massive personal productivity gains, many don't share this experience. It's easy to leave this as a difference in skills, but what techniques do the top practitioners use that get outsized performance?

This document answers that question by examining the actual Claude Code logs of one high performance user. It shows the general workflow and mechanisms used to produce large amounts of production-ready output across multiple concurrent projects.

The person has 35 years of professional experience in SaaS and software engineering, both from implementation and management. The processes and supporting tools encode the engineering management best practices and lessons learned over a long career.

How to Read This Document

This presentation has three viewing modes. Press D to cycle between them:

Mode	Purpose	Best For
Presentation	Full-screen slides with key insights	Talks, quick overview
Detail	Slide + expanded analysis below	Deep reading, exploration
Document	Scrollable long-form with all content	Reference, printing

The slides tell the story. The detail sections provide evidence, methodology, and nuance. You're currently in Detail or Document mode—that's why you can see this text.

What You'll Learn

This document answers one question: How does a user get AI coding assistants to work autonomously over long periods to produce complex deliverables at an acceptable level of quality?

This is shown through a forensic examination of chat logs with notes about why the user made the relevant decisions. It is taken from 97 days of real logs from one developer running six concurrent projects with Claude Code.

The Journey

Section	What It Covers
The Hook	A single prompt that triggered 13 hours of autonomous work
The Numbers	543 hours, roughly $500/month, and what the math actually looks like
The Pyramid	Where human time goes when AI handles execution
The Infrastructure	Four pillars that make long-range autonomy possible
The Economics	Why quality gates pay for themselves
Start Monday	Concrete steps to begin building your own scaffolding
Appendices	Methodology, cost breakdown, code quality evidence

The Core Claim

With the right infrastructure, one person can produce the output of multiple engineering teams in parallel—not by working harder, but by building scaffolding that lets AI magnify the human strengths.

The rest of this document shows exactly how it works, with receipts.

The Data

Everything here came from
one developer's chat logs.

Not a team. Not a survey. Parsed directly from 97 days of Claude Code sessions— six concurrent projects, 2,314 agent sessions, all verifiable.

14,926 prompts · 2,314 agent sessions · 543 autonomous hours

Why This Matters

When you see aggregate statistics—thousands of prompts, hundreds of hours—the natural assumption is "a team did this." That assumption makes the data feel distant, organizational, not personally achievable.

The truth: this is what happens when you build the right infrastructure—and let it run.

The Leverage Equation

Traditional scaling requires hiring. You need more people to do more work. The constraint is headcount, budget, coordination overhead.

With the right AI scaffolding, one person can:

Run 6 projects concurrently (each receiving 5-7 person-equivalent output)
Execute 543 hours of autonomous work over 97 days
Ship 165 releases that passed CI/CD
Run a 4-layer automated verification pipeline (unit → E2E → Playwright → visual)

Infrastructure, Not Talent

This output isn't the result of working 16-hour days. It's the result of:

Scaffolding: 60 MCP tools that encode workflow knowledge
Process: Four-phase workflow (exploration → planning → implementation → review)
Review gates: 2,974 quality checks—Gemini validates Claude's work against WAF pillars
Cost prevention: Bad ideas rejected at $1 (design phase), not $100 (production)

The rest of this presentation shows exactly how it works—and how you can replicate it.

The Hook

1 prompt complete feature release

"Kicked it off. Came back 13 hours later. Done."

Analyze

40 min

→

Design

40 min

→

Implement

10+ hours

→

Ship

2 hours

What shipped: Backend services + event pipeline + data layer + frontend components + Terraform + integration tests

Click to reveal

"I typed one sentence. Went to bed. Woke up to a deployable release."

What "One Prompt" Actually Produced

This isn't replicating the legacy process by spawning agents to play the roles we assign to humans in legacy practice. The orchestrator executed a well-structured plan—reading dependency chains from the database, sequencing work accordingly, monitoring progress, and adapting.

The Four Phases That Made This Possible

That 13-hour autonomous run was part of a larger workflow:

Phase	Mode	What Happens
1. Exploration	Interactive	Human + AI discover what to build. Back-and-forth discussion, research, prototyping ideas.
2. Planning	Largely autonomous	AI creates tasks, captures dependency chains, structures the release. Gemini validates via `review_plan`.
3. Implementation	Fully autonomous	Orchestrator reads the plan from the database and executes it. This is the "13 hours" part.
4. Review	Autonomous loop	Gemini runs `review_code` on output. Agent fixes issues until it passes. 867 review calls total.

Key insight: The orchestrator didn't "figure out" dependencies—it read dependency chains that were captured during planning. The leverage is in the setup, not just the execution.

13 hours autonomous

4 waves of agents

1,212 files touched

6,670 code edits

What Actually Shipped

Backend (Go)

Domain state machine (3 states, transition rules)
New data model + repository layer
Core service with business logic
Pub/Sub event pipeline (3 event types)
Upsert operations with conflict resolution

Frontend + Infra

TypeScript interfaces + display components
Reusable UI component library additions
Terraform configs for new resources
Cloud Build pipeline updates
Integration tests (full coverage)

Autonomous Project Management

After receiving the prompt, the orchestrator's first action was to analyze the work:

"Current v2.5 Status: 60 total tasks (13 cancelled). 40 tasks need design (must go through design_documents → review_design before implementation). 7 tasks in todo (can be claimed directly)."

It identified that most tasks were blocked by dependencies. Its strategic response:

Spawn a Backend Design Agent to work through the design tasks (unblocking the chain)
Spawn a Frontend Agent to work on ready-to-go tasks in parallel

The Orchestration Loop

After each wave completed, the orchestrator checked status and decided what to do next:

Time	Progress	Orchestrator Decision	Agents
+0.0h	0%	Analyze dependencies → spawn design + frontend agents	2
+0.6h	18%	Design done → spawn implementation agents for Phases 1-3	3
+11.0h	58%	Core complete → spawn agents for remaining work	3
+12.3h	68%	Almost done → spawn final frontend + test agents	2
+12.9h	100%	"v2.5: Event Processing Pipeline - COMPLETE!"	—

Why This Matters

The orchestrator did the work of a project manager executing a plan:

Dependency reading: Queried the database for pre-captured dependency chains
Strategic sequencing: Started with design to unblock implementation (following the graph)
Progress monitoring: Checked status after each wave
Adaptive execution: Adjusted agent focus based on what remained
Completion verification: Confirmed all tasks done before reporting complete

The heavy lifting happened earlier: during exploration (figuring out what to build) and planning (capturing the structure and acceptance criteria). The autonomous execution during implementation is return on that up-front investment.

Scale of the Work

Analysis of the actual session logs shows what those 10 agents did:

Metric	Count	What It Means
Shell commands	21,464	Tests run, builds triggered, git ops
File reads	8,474	Understanding existing code
Code edits	6,670	Modifications to source files
Files created	983	New source files written
Source files touched	1,212	.go, .ts, .tf, .proto, etc.
Total tool calls	66,325	Autonomous actions taken

Note: These numbers are from parsing the actual JSONL chat logs, not estimates.

The Actual Prompt (Implementation Phase)

Please review the docs/prompts exposed as resources by the project MCP
to understand the process then spawn appropriate restricted agents
in the foreground to burn down the tasks in [Release v2.5].

That's it. 47 words. But this prompt only works because of what it points to.

What Made This Possible: The Resource Documents

The key phrase is "review the docs/prompts exposed as resources." Those resources define the entire workflow:

Resource	What It Teaches the AI
`development-workflow-sequence`	Complete methodology: walking skeleton approach, 4-phase workflow, quality gates
`determine-needed-agents`	Decision matrix: when to spawn agents vs. do work yourself, capacity guidelines
`escalation-decision-matrix`	5-level escalation framework with specific triggers for each level
`coordinator-troubleshooting-guide`	Real-world patterns: "agent claims are 50% accurate—always verify with database"

These aren't vague guidelines—they're operational playbooks with decision trees, copy-paste commands, and hard-won lessons from production use. The AI reads them, internalizes the process, then executes it.

The workflow lives in the resources, not in the prompt.

The Other Enabler: Structured Data

The resources teach the process. The database provides the data:

Task graph: 60 tasks with dependency chains captured during planning
Status tracking: Which tasks are ready, blocked, or complete
Release scope: What belongs to this release vs. future work

SELECT * FROM tasks WHERE release_id = 'v2.5'
  AND status = 'ready'
  AND all_dependencies_complete = true

Resources define how to work. The database defines what to work on. Together, they enable a 47-word prompt to trigger 13 hours of coherent execution.

Terminology

Term	Definition
Arc	A period of autonomous work triggered by one user prompt. Can last minutes (quick arc) or hours (release arc).
Orchestrator	The main AI session that manages the workflow—analyzing tasks, spawning agents, monitoring progress.
Work item	A discrete unit of work tracked in the project database—a feature, bug fix, test, or refactor.
Wave	A batch of agents spawned to work in parallel, followed by a status check before the next wave.

The Question

Chat is not a workflow.

Most AI adoption stops at "type a prompt, get a response." That's using a power tool as a hammer. What happens when you build around it instead?

"The gains don't come from better prompts. They come from different infrastructure."

Two Modes of AI Use

Most people use AI in one mode:

Interactive: Human types, AI responds, human reads, repeat. The AI accelerates individual tasks but doesn't change the workflow structure.

The second mode is different:

Autonomous: Human defines the work, AI executes without supervision, human reviews the result. The workflow runs while you're away.

What Makes Autonomous Mode Possible

You can't just "tell the AI to do more." You need infrastructure:

Tools: APIs and commands the AI can call without human mediation
State: Persistent context so the AI knows where it is in a workflow
Gates: Automated checks that catch errors before they compound

The rest of this presentation shows what that infrastructure looks like—and how it produces 543 hours of work from one person.

29 release arcs.
Same structure every time.

97 days of release arcs

"If v2.5 happened once, you'd call it a demo. A fluke."

4.1 hours v2.5 (12.9h) 17.3 hours

← sorted by duration | Oct 2025 – Jan 2026 →

29 RELEASE ARCS | 298 of 543 HOURS | 1 PROMPT

Click to reveal the pattern

"The 13-hour run wasn't an outlier—it was just Tuesday."

The Boring Miracle

If the v2.5 story impressed you, you should be skeptical. One impressive demo proves nothing.

But over 97 days, we ran the exact same pattern 29 times. That's not luck—it's a stable orbit.

The Numbers

29 release arcs

298.5 of 543 total hours

10.3 avg hours/arc

253 agents spawned

The Boring Part

Here's the most boring part of this story: the prompt never changed.

For all 29 release arcs—spanning 298 of the 543 total autonomous hours—the trigger was identical:

Please review the docs/prompts exposed as resources by the project MCP
to understand the process then spawn appropriate restricted agents
in the foreground to burn down the tasks in [Release].

Stable input. Stable system. Stable output.

Duration Distribution

The 29 release arcs ranged from 4.1 hours to 17.3 hours:

Duration	Count	Examples
4-6 hours	6	Smaller releases, single-domain work
6-10 hours	5	Medium complexity, multiple services
10-14 hours	14	Full feature releases like v2.5
14+ hours	4	Large multi-system integrations

Why It's Reproducible

The pattern works because the scaffolding is consistent:

Process docs as resources: Every orchestrator reads the same playbook
Task database: Structured work queue with dependencies
Review gates: Quality checkpoints at each stage
Restricted agents: Sandboxed execution with clear scope

The orchestrator isn't doing magic. It's following a well-documented process—the same way a good PM would.

The Investment Behind Reproducibility

This scaffolding didn't appear overnight. It evolved over 5 months:

~June 2025: Philosophy codified in ~700 LOC of shell scripts (review gates)
Aug 2025: MCP initial commit—philosophy embedded in Go + SQLite
Aug-Jan: 525 commits, 2 → 60 tools, single → multi-model

What the 29 release arcs demonstrate is the payoff—a mature system in operation. The scaffolding was built incrementally, one tool at a time.

v2.5 In Context

The v2.5 release (highlighted in teal above) was our 21st release arc. At 12.9 hours, it was actually slightly above average—not exceptional.

What made it a good example for this presentation:

Recent enough to have detailed logs
Complex enough to show multi-wave orchestration
Clean execution with no interruptions

But any of the 29 would tell a similar story.

The Constraint Changed

Not time-limited anymore.
Token-limited.

Before

██░░░░░░░░░░░░░░

↓ wait

░░░░██░░░░░░░░░░

↓ wait

░░░░░░░░██░░░░░░

Serial: One thing at a time

Now

project-mcp ████████░░

saas-platform ██████░░░░

web-portal █████████░

workflow-platform ███████░░░

api-v2 ████░░░░░░

cloud-finops ██████░░░░

Parallel: 6 teams running concurrently

543 hours of agent work in 3 months—across 6 parallel projects.
Scaling = more projects. Bottleneck = your bandwidth to steer them.

The Bottleneck Shifted

Era	Bottleneck	Scaling Strategy
Traditional	Human hours	Hire more people
Copilot	Human attention	AI assists, human still bottleneck
Autonomous	API tokens	Run more agents in parallel

Why This Matters

Human time is finite and non-purchasable. API tokens are purchasable. This changes the economics:

Old question: "Do I have time to do this?"
New question: "Is this worth the tokens?"

In Practice

I hit Claude Code's token limits when trying to scale up. The constraint is no longer my calendar—it's my API budget. That's a good problem to have.

"The goal isn't to make AI faster. It's to remove you from the critical path."

The Math

543 hours agent work ÷ 90 days = 6.0 hours/day
But I worked ~2 hours/day on this project
→ 3× parallelism achieved (limited by token caps)

Organization-Level Output

This isn't just volume—it's multiple engineering teams running in parallel:

Project	Sessions	Domain
project-mcp	774	The MCP powering this workflow
saas-platform	756	SaaS product
web-portal	182	Web platform
workflow-platform	141	Legacy polyglot event-driven microservices monorepo
api-v2	~200+	Monorepo SaaS
cloud-finops	305	AI-powered FinOps intelligence platform

Each project gets the equivalent of a 5-7 person team. One person + AI scaffolding = output of multiple engineering teams in parallel.

The Numbers

The math that makes
hands-off work.

543 autonomous hours

$500 per month

$2.76 per hour

25-40× leverage

5%
of arcs
→
48%
of autonomous hours

Power law: 29 release arcs (avg 10h each) delivered nearly half of all autonomous work.

Cost Breakdown

Item	Monthly Cost
Claude Max+ (2 accounts, rotating)	$400
OpenAI/Google API (Codex, Gemini)	~$100
Total	~$500/month

The Mindset Shift

"I used to try to conserve tokens. That's the wrong mental model. Tokens = work getting done. The more tokens I'm using, the more the AI is doing for me. The flat rate helped me get over that mental barrier."

The Leverage Math

543 hours of autonomous work ÷ ~3 months = ~6 hours/day of Claude working independently.

At $100-150/hour for an engineer, that's $54,000-81,000/month equivalent for $500.

Note: This is the payoff from 5 months of scaffolding development—525 commits building from shell scripts to 60 MCP tools.

But Is It Slop?

No. Here's why:

Anti-Slop Indicator	Evidence
Verification exists	4-layer automated pipeline (unit → E2E → Playwright → visual)
It deploys	165 releases passed CI/CD pipelines
It was planned	356 design docs (PRDs, specs, architecture)
Infra is real	1,956 Terraform + CI/CD file modifications
It's documented	2,891 markdown file modifications
It's been reviewed	2,974 quality gate checks (WAF pillars, pattern enforcement)

Slop is code-only, test-free, and doesn't ship. This is PRD → Design → Code → Test → Deploy → Document. Full vertical stack.

More importantly: quality is enforced at the $1 phase (design review), not discovered at the $100 phase (production). The 2,974 gate checks aren't overhead—they're the reason this work has value.

The Key Discovery

46% stays interactive.
That's the point.

Value Delivery

18%

of arcs → 68% of hours

Momentum

37%

of arcs → 10% of hours

Steering & Alignment

46%

of arcs → 22% of hours

"The 46% is where decisions get made. That's the work."

The Three Tiers

Tier	Arc Types	% of Arcs	% of Hours	Purpose
Steering & Alignment	interactive, review	46%	22%	Human judgment, decisions, quality gates
Momentum	quick, build	37%	10%	Routine tasks, keep progress moving
Value Delivery	feature, release, debug	18%	68%	Major work, overnight capability

Five Key Lessons from 650 Arcs

Autonomy is a spectrum, not a switch. "Maximize autonomous time" is the wrong goal. "Right autonomy level for right work" is the goal.
Power law: 5% of arcs = 48% of hours. Don't measure success by "% autonomous." Measure by impact of autonomous work.
The burn down pattern is THE unlock. 80.6% of arcs have "release" intent. This single pattern enables overnight autonomous releases.
Interactive work is where value gets directed. Review arcs prove guardrails work. Interactive arcs prove humans stay in control.
Momentum + Power Moves = Velocity. Quick arcs maintain momentum. Release arcs deliver value. The MIX creates sustainable velocity.

Arc Type Breakdown (650 arcs analyzed)

Type	%	Avg Duration	Agents	Description
review	24.9%	23 min	0	Human-driven code/design review
quick	22.0%	5 min	2.7	Fast single tasks
interactive	20.9%	33 min	0	Direct conversation, Q&A
build	14.5%	33 min	4.4	Test/build cycles
feature	11.8%	112 min	9.4	Multi-task implementation
release	4.5%	618 min	8.7	Full release burn down
debug	1.4%	118 min	1.2	Investigation cycles

The Problem

Chat is not autonomy.

Conversational AI (Today)

Human feeds context
AI generates code
Human copy-pastes
Human runs tests
Human interprets results

You're the glue code.

Orchestrated Autonomy (Goal)

Human sets goal
System reads task queue
System spawns agents
Agents run tools, update state
Human reviews completed work

The system is the API.

Two Modes of Interaction

Analysis reveals 42% of prompts are repetitive commands (structured), while 58% are adaptive collaboration (context-specific steering).

Structured Commands

Pattern	Count	Example
Context compaction	703	`/compact`
Delegation template	403	"Please review docs/prompts..."
Confirmations	376	"Yes please"
Review tools	324	"Run review_code on R70"

Adaptive Collaboration (58%)

The "noise" represents real-time steering: bug investigation, architecture decisions, UI feedback, cross-session coordination. This is the human-in-the-loop providing context templates can't capture.

The Setup

What would you give
a new contractor?

"You wouldn't hand someone a ticket and say 'fix the bug' then disappear. You'd point them to the repo, show them how to run tests, explain the standards, and ask for a PR."

That's the infrastructure this presentation is about.

What You Give a New Contractor

Context: Codebase tour, documentation, domain knowledge
Tools: How to run tests, deploy, access logs
Process: PR review, code standards, definition of done
Feedback: Reviews, iteration, course correction

The Parallel

AI needs the same scaffolding. The difference: you can codify it once and reuse it forever. The process docs become MCP resources. The review criteria become automated gates. The feedback becomes tool output.

But There's a Deeper Insight...

This metaphor is useful—but it's still task-oriented. You're thinking about how to guide someone through work.

The real unlock is moving from task delegation to outcome delegation. From micromanaging steps to setting boundaries.

We'll return to this after examining the collaboration spectrum.

The Spectrum

Three levels of AI collaboration.

L1

Conversational Copilot

Text-in, text-out. Human is the API. Most teams are here.

L2

Tool-Using Assistant

AI has tools (file read, shell, search). Still reactive—can't plan or chain.

L3

Orchestrated Autonomy

AI operates within a system. Reads queue, spawns workers, enforces standards.

What Moves from Human to System

Responsibility	L1	L2	L3
Code generation	AI	AI	AI
File operations	Human	AI	AI
Running tests	Human	AI	AI
Task planning	Human	Human	AI (validated)
Quality review	Human	Human	System (gates)
Progress tracking	Human	Human	System (state)
Error recovery	Human	Human	AI (respawn)

The leap from L2 to L3: The system handles planning, review, and state—not just execution.

The Shift

Delegate outcomes,
not tasks.

Manager

"Write function X, then call it from Y, then update the tests."

Executive

"Complete Release 5.9."

"The infrastructure isn't instructions. It's boundaries.
Inside those boundaries: freedom."

The Kindergarten Teacher Lesson

In US military leadership training, commanders are sent to observe an elementary school playground. They arrive thinking:

"I am going to control every movement of every kid on that playground."

Of course, this doesn't work. No commander—no matter how skilled—can micromanage 30 children at recess.

But the kindergarten teacher succeeds effortlessly. How?

"Set boundaries. Give freedom within them."

The fence defines the playground (not the path each child takes)
The rules prohibit certain behaviors (not prescribe every action)
The bell signals transitions (not a step-by-step schedule)

This is exactly how L3 autonomy works.

The Trigger Prompt

Here's the actual prompt that launched 29 release arcs totaling 298 hours:

Please review the docs/prompts exposed as resources by the project MCP
to understand the process then spawn appropriate restricted agents
in the foreground to burn down the tasks in [Release].

Notice what it doesn't specify:

Which files to edit
Which tools to use
What order to work in
How to solve problems

It only specifies:

Where to find the rules: docs/prompts as MCP resources
What constraints exist: restricted agents, foreground visibility
What the outcome is: burn down the tasks in Release

Task Delegation vs Outcome Delegation

Aspect	Task Delegation (L2)	Outcome Delegation (L3)
You specify	HOW to do each step	WHAT success looks like
Agent decides	Nothing (follows script)	Implementation details
Control via	Instructions (prescriptive)	Boundaries (prohibitive)
Scales to	Minutes of work	Hours of work
Human role	Operator	Executive

Infrastructure as Boundaries

Each component constrains a failure mode:

Component	Not This (Instructions)	But This (Boundaries)
Task Database	"Work on Task 42"	"Here's the queue; claim what's ready"
Real Tools	"Use grep, then sed"	"Here are your tools; choose wisely"
Process Docs	"Follow steps 1-10"	"Here's the process; adapt to context"
Review Gates	"Format code this way"	"Pass review or revise"

Why This Matters

Task delegation doesn't scale. If you have to specify every action, you become the bottleneck. You've just built a voice-controlled IDE.

Outcome delegation scales. You define success once, encode the boundaries, and let agents find the path. The boundaries scale. You don't have to.

This is why the 13-hour release arc was possible. Not because the AI is smart—but because the boundaries were clear.

The Infrastructure

Four things agents need.

1. A Task Database

Machine-readable queue with status, dependencies, blocking rules.

11,956 tracking calls

2. Real Tools

Same access as human engineers: file system, shell, search, deploy.

56,315 Bash calls

3. Process Docs

Tools report position: "step 5/12, next: run tests." Stateful tools, stateless agents.

13h release arcs

4. Review Gates

Automated checks that catch errors. A different model reviews the work.

2,974 review calls

These define the boundaries. Inside them: freedom.

How the Infrastructure Grew

This infrastructure wasn't built in a week. It emerged over 5 months of iterative development:

Component	Started As	Evolved Into
Task Database	Manual task lists	SQLite task queue with status machine, dependencies, auto-transitions
Real Tools	Basic Bash + Read/Write	60 MCP tools including knowledge graph, analytics, LLM guidance
Process Docs	Informal patterns	Process docs exposed as MCP resources, codified delegation prompts
Review Gates	2 shell scripts (~700 LOC)	review_code, review_plan, review_design with multi-model validation

Philosophy First, Tools Second

The guiding principle—walking skeleton, TDD, objective verification—was codified in CLAUDE.md before any tooling existed. The shell scripts implemented review gates. The MCP embedded them into persistent tools. The multi-model orchestration refined them further.

What never changed: The philosophy. What evolved: the implementation.

The Key Architectural Choice: LLMs Evaluating LLMs

Review Gates have a specific mechanism: the model that does the work is not the model that reviews it.

Claude implements (writes code, creates designs)
Gemini evaluates (enforces criteria encoded by the human)
Human legislates (sets the criteria once, applied 2,974 times)

This separation of execution and evaluation is what makes the 543 autonomous hours trustworthy. See the Review Gates deep-dive for the full mechanism.

Infrastructure 1

The Task Database

A SQLite queue with status machine, dependencies, and blocking rules.

work_tracking

6,847

project_planning

5,109

project_status

3,634

task_queue

2,723

Why This Matters

11,956 calls to tracking/planning tools means agents are constantly asking: "What is my goal? What is the current status? How do I report progress?"

Without structured state, the agent is flying blind. With it, the orchestrator knows what to do next.

Implementation

A project database (SQLite via MCP) with tables for tasks, releases, and status. Agents query it constantly, update it as they work, and the orchestrator uses it to decide what to spawn next.

Infrastructure 2

60 MCP Tools

Same access as human engineers: file system, shell, search, deploy.

Bash

56,315

Read

28,846

Edit

20,058

Grep

9,062

The Ratio Tells a Story

Bash (execute) > Read (understand) > Edit (change) > Grep (search). This is the same ratio you'd see from a productive engineer.

Key Insight

"The bulk of work isn't abstract reasoning—it's concrete, small, verifiable actions."

Practical Tip

Don't build new AI-specific tools. Wrap your existing linter, test runner, and deploy script. The AI can use the same commands you use.

Infrastructure 3

Process Docs

Tools report position: "step 5/12, do this next." Stateful tools, stateless agents.

The Orchestration Loop
                        1. Read task queue

                        2. Spawn restricted worker agent

                        3. Worker does task, updates state

                        4. Worker completes or fails

                        5. Orchestrator checks queue

                        6. If tasks remain → goto 2

                        7. If empty → report completion

70% quick arcs (<15 min)

30% release arcs (hours)

The "Burn Down" Pattern

Please review the docs/prompts exposed as resources by the project MCP
to understand the process then spawn appropriate restricted agents
in the foreground to burn down the tasks in [Release]. Remind them to
work with pal as a partner and the tools in the project MCP.

Two Arc Types

Type	Duration	Use Case	Example
Quick arcs	< 15 min	Single tasks, fixes	Run tests, add endpoint
Release arcs	2-13+ hours	Full features, overnight	v2.5: 47 items, 66K tool calls

The workers are composable units. The orchestration period is the true autonomous window—an orchestrator can spawn dozens of workers over hours.

Worker Duration Distribution

< 1 min         72   4.0%  █
1-2 min       131   7.3%  ███
2-5 min       343  19.0%  █████████
5-10 min      624  34.6%  █████████████████ ← Most common
10-20 min     407  22.5%  ███████████
20-40 min     122   6.8%  ███
40+ min       106   5.9%  ██

Note: The 40+ min workers include cases like agent-ad7836b which ran for 10h 17m during v2.5, handling complex backend implementation autonomously.

The GPS Navigation Pattern

Here's the secret: agents don't memorize the route. Every tool response includes workflow guidance:

{
  "step_number": 5,
  "total_steps": 12,
  "next_step_required": true,
  "required_actions": ["Run tests", "Update documentation"],
  "guidance": "Implementation complete. Verify tests pass before proceeding.",
  "auto_fix_tasks": [...],
  "human_decisions_needed": [...],
  "escalations": [...]
}

Stateful Tools, Stateless Agents. The agent doesn't need to maintain context for 10 hours. It executes a task, gets a result, and the tool tells it what's next. The long-running arc is managed by the orchestrator and stateful tools, not a single, fragile agent context.

Component	Analogy	What It Does
Agent	Driver	Executes current step
Tool Response	GPS Voice	"In 500m, turn left"
Task Queue	Route Plan	All waypoints to destination
Workflow Guidance	GPS Display	"Step 5/12, next: run tests"

This is why one simple prompt can trigger 13 hours of coherent work. The project-mcp is portable scaffolding—same tools, same workflow guidance, applied to any project.

Infrastructure 4

Review Gates

The model that does the work is not the model that reviews it.

Plan → review_plan → Design → review_design → Code → review_code

Process docs first "review the docs/prompts..."

Plan review (Gemini) Walking skeleton enforced

Design review Scope check vs. task

Code review 867 review_code calls

State visibility 6,847 work_tracking calls

The Core Mechanism: Separation of Execution and Evaluation

The model that does the work is not the model that reviews it.

Role	Who	What They Do
Legislator	Human	Encodes judgment in criteria (CLAUDE.md, prompt templates)
Executive	Claude	Implements—writes code, creates designs, executes tasks
Judiciary	Gemini	Evaluates against criteria, enforces standards

This separation prevents: AI reviewing its own work (conflict of interest), human reviewing everything (doesn't scale), or no review (dangerous).

The Triage System

Every review finding gets categorized:

Category	What Happens	Example
AUTO_FIX	Reflex-level response—agent handles automatically	Formatting, unused imports, simple error handling
HUMAN_DECISION	Requires deliberate attention—escalate with options	Algorithm choice, API design trade-offs
ESCALATE	Stop work, require human review	Architecture changes, security concerns

Result: Human reviews 0% of AUTO_FIX, some HUMAN_DECISION, all ESCALATE. The criteria scale. You don't have to.

The Key Insight

"I encoded my engineering judgment in 360 lines of shell script. Gemini applied it 2,974 times. The criteria scaled. I didn't have to."

The Artifact Pipeline: What Gets Produced and Evaluated

The MCP tooling shapes what agents produce. Then the eval checks both structure and spirit:

Artifact	Structured Fields	Objective Checks	Subjective Checks
Release	Tasks, phases, dependencies	All tasks have IDs? Dependencies valid?	Does this represent a vertical slice?
Task	Acceptance criteria, priority, estimated hours	Has acceptance criteria? Measurable?	Right size? Not over-engineered?
Plan	7-section structure	All sections present? Success criteria objectively measurable?	Embodies walking skeleton? WAF pillars addressed?
Design	Problem analysis, proposed solution, risk assessment	Schema complete? Alternatives considered?	Scope matches task? Not gold-plating?
Code	Files, functions, test coverage	Tests exist? Builds pass?	Follows project patterns? WAF security pillar?

Key insight: The MCP tooling ensures agents produce structured artifacts with required fields. The eval then verifies both the structure (objective) and the spirit (subjective).

Example: Plan Review with WAF Pillars

When review_plan runs, it applies Google's Well-Architected Framework pillars in priority order:

Operational Excellence — Monitoring, alerting, rollback strategy
Security — Auth approach, data protection, secrets management
Reliability — Error handling, graceful degradation, testing strategy
Performance — Scalability, bottleneck identification
Cost Optimization — Resource efficiency, cost awareness

This is how "embody the spirit of WAF" becomes a concrete, repeatable evaluation—not a vague guideline.

Anti-Patterns That Get Auto-Flagged

Anti-Pattern	What Gemini Flags
TDD Micro-management	Tasks named "RED", "GREEN", "REFACTOR" as separate items
Premature Infrastructure	CI/CD before core functionality works
Horizontal Slicing	"Database Layer", "API Layer" instead of vertical features
Integration Assumptions	External API tasks without de-risking spikes
Over-engineering	Design scope exceeds the specific task

Plan Review Criteria

The review tool requires 7 mandatory sections:

Objective
Current State Assessment
Success Criteria (must be objectively measurable)
Technical Design (brief—one paragraph)
Implementation Steps (each with objective verification)
Monitoring and Alerting
Security Considerations

The Economics

Writing code is cheap.
Owning it is expensive.

20%

Initial Development

80%

Lifecycle Cost

2,974 quality gates isn't bureaucracy.
It's aggressive asset protection.

The 1-10-100 Rule (Boehm's Law)

The cost to fix a defect rises exponentially the later you find it:

$1

$10

$100

Design
(review_plan, review_design)

Development
(review_code)

Production
(Ops/Support)

The Fast Code Trap

Scenario: AI generates 10,000 lines of code in 1 hour.

Without gates: You just acquired 10,000 lines of unverified liability. You're in debt.
With gates: The system rejects what doesn't pass. You keep what works. The debt never accrues.

AI makes code generation nearly free. That's precisely why quality gates become more important, not less.

Reframing the 2,974 Invocations

Gate Type	Count	Value
review_design	1,687	High Leverage: Stopped bad ideas before code was written.
review_code	867	Debt Prevention: Enforced patterns, prevented "spaghetti".
review_plan	420	Alignment: Ensured we built the right thing.

What the Review Agent Examines

The review_code tool isn't a simple linter. It's a fully agentic process with access to:

Task requirements: What was this code supposed to accomplish?
Design documents: What was the approved approach?
Actual artifacts: The code, tests, and infrastructure files
WAF pillars: Security, reliability, operational excellence, performance, cost

It verifies that implementation matches design, code meets task requirements, and all artifacts align. This isn't validation—it's verification.

"The most valuable work the AI did wasn't the code it wrote.
It was the 2,000+ times it told me 'No'."

Smart Triage

Decisions get captured.

Every time you make a call, the system records it. Next time: fewer questions.

AUTO_FIX

Reflex—agent handles

formatting, unused vars

HUMAN_DECISION

Deliberate—you choose

design trade-offs

ESCALATE

Full context + recs

architecture changes

"Simple stuff gets fixed automatically. Hard stuff gets escalated with context."

The Accumulation Effect

Every human decision gets captured to a per-project Knowledge Base. Next time a similar question arises, the system checks existing decisions before asking:

// Agent encounters design trade-off

search_decisions(query: "error handling style")

→ Found: "Use Result types, not exceptions" (decision: abc-123)

// Agent applies consistent pattern automatically

Why This Matters

Concern	How the System Answers It
"How do I trust it?"	Transparent triage with confidence scores + rationale
"Won't I repeat myself?"	KB remembers decisions per-project
"How does it scale?"	Each decision increases future autonomy

Triage in Action

From review_code prompt template:

**Triage Category Guidelines:**
- AUTO_FIX: Simple, mechanical fixes that don't change
  logic or architecture
- HUMAN_DECISION: Issues requiring design choices,
  trade-offs, or architectural decisions
- ESCALATE: Major issues requiring significant
  refactoring or cross-team coordination

**Confidence**: Rate your confidence (0.0-1.0)
**Rationale**: Why this triage category was chosen

The Compound Effect

As the Knowledge Base grows:

Month 1: Agent asks about error handling style
Month 2: Agent finds prior decision, applies consistently
Month 6: New patterns emerge from 50+ captured decisions
Month 12: Agent "knows" your project's philosophy

This is institutional knowledge that survives team changes, onboards new agents, and compounds over time.

Leverage

5% steering.
100% implementation.

Human expertise flows into the system in two ways.

Encoded (Passive)

Review criteria, CLAUDE.md, process docs. Applied 2,974 times by evals.

Direct (Active)

Occasional interventions. Short messages that redirect entire arcs.

"One sentence about bounded contexts saved 40 hours of wrong implementation."

Real Examples from Chat Logs

These are actual human interventions from the MCP development logs:

1. Enforcing Domain Boundaries

"Why does the review-service need any DB packages at all? It just uses tools to pull what it needs, it should not do direct DB access because that is not part of its bounded context."

Impact: Prevented tight coupling between services. The review-service became a stateless orchestrator that only uses tool interfaces.

2. Mandating DDD Architecture

"Please examine the knowledge graph functionality... work with pal to come up with the DDD plan and relevant bounded contexts and then create a design for this new, extracted service."

Impact: Kicked off proper architectural thinking. AI identified 4 bounded contexts in 6,200 LOC of code, designed hexagonal architecture with proper domain layer separation.

3. Correcting Storage Architecture

"Why are we putting this into Firestore instead of BQ?"

Impact: 8-word question that redirected the entire storage strategy. AI had defaulted to Firestore for graph data; corrected to BigQuery for analytical workloads.

4. Diagnosing Multi-Tenancy Flaw

"The deeper issue is that we are supposed to have a dataset per tenant. Right now, it seems like the schema is very wrong, with a dataset per microservice."

Impact: Human diagnosed the root cause of tenant isolation issues. Fundamentally shifted the infrastructure architecture from service-per-dataset to tenant-per-dataset.

The Amplification Effect

Intervention	Human Time	Implementation Time Saved/Redirected
Bounded context correction	~30 seconds	~4 hours of wrong coupling
DDD architecture mandate	~2 minutes	~20 hours of design + implementation
Storage architecture redirect	~10 seconds	~8 hours of wrong storage layer
Multi-tenancy diagnosis	~1 minute	~10 hours of debugging

Total: ~4 minutes of human expertise steering ~42 hours of implementation.

The Two Loops of Expertise

Human expertise enters the system through two complementary mechanisms:

Encode Once, Apply Forever: Review criteria, process docs, CLAUDE.md. Your judgment becomes automated evaluation. (Review Gates)
Intervene Occasionally, Redirect Completely: Short messages at key moments. Human pattern recognition catching AI drift.

The first loop scales without limit. The second loop ensures the first loop stays calibrated.

What Makes Good Interventions

Pattern	Example
Socratic questions	"Why are we putting this in Firestore?"
Architectural principles	"not part of its bounded context"
Root cause identification	"The deeper issue is..."
Quality standards	"enterprise-grade domain driven design... elegant"

Notice: None of these tell the AI how to implement. They tell it what principles to apply.

Real Example

The 6h 42m agent.

It came back with news I could act on.

The task

Create smoke test for undo feature. Test failed: unknown message type: <nil>

What it reported back

"The undo code is in the repo but not deployed here. I can't test a feature that isn't running. Deploy it first, then I'll rerun the test."

No invented solution. Just: "here's the wall, here's why I stopped."

Why This Is Success

The alternative failure modes are worse:

Hallucinate a fix: Agent invents code changes that "should" work
Retry endlessly: Agent keeps running the same failing test
Silent failure: Agent marks task complete without verification

This agent did the right thing: investigated, found the real blocker (deployment gap), and reported back with actionable information.

The 413 Messages

This agent sent 413 messages over 6h 42m. Breakdown:

Tool	Count	Purpose
Bash	84	Running tests, git ops, deployments
Read	25	Understanding existing code
Grep	16	Finding implementations
Edit	13	Creating/modifying test script
Glob	6	Finding files
TodoWrite	5	Tracking progress
MCP tools	4	Design docs, project status

The Takeaway

You can walk away because when it hits a wall, it tells you which wall and why. Both outcomes give you actionable information. What's unacceptable is silent garbage or invented workarounds.

The Philosophy Shift

Brooks said throw one away.
Now you can afford to.

Brooks (1975)

"Plan to throw one away; you will anyway."

Now (2025)

V1 is the question. V2 is the answer.

18,866 lines deleted in one cutover

16,424 "v2" mentions across project

The Rough Draft Method

Write it wrong
on purpose → Find the seams
from running code → Regenerate
in hours

The first version's job is to be wrong in useful ways.

The Economics Inversion

Resource	Traditional	AI-Assisted
Code production	Expensive (human-months)	Cheap (tokens)
Human attention	Available	The bottleneck
Throwaway code	Waste	Investment in understanding

"We wrote 18,866 lines specifically to learn why we shouldn't keep them."

Implementation as Inquiry

If implementation costs tokens (cheap) instead of human-months (expensive), building V1 becomes the optimal requirements gathering technique.

English specs: Low-fidelity, ambiguous
Running code: High-fidelity, binary (works or doesn't)
V1: Executable speculation that exposes unknown unknowns

You're not building to ship. You're building to discover what you should build.

The Phoenix Architecture

A mindset for AI-assisted development:

Phase	Purpose	Outcome
V1 (Probe)	Let AI build the feature	Discover where boundaries should be
Audit	Read the code for structure, not syntax	Identify coupling, duplication, gaps
V2 (Structure)	Regenerate from scratch with lessons	Clean implementation with proper interfaces

Key insight: Don't refactor V1. Delete and regenerate. The cost of fixing AI's "first draft" assumptions often exceeds the cost of a clean V2.

What to Preserve vs. Throw Away

Preserve (Human Attention Artifacts)	Throw Away (Token Artifacts)
Interface contracts, type definitions	Implementation details
Design decisions and rationale	Code that no longer fits
Test suites (the "truth" of the system)	First-draft architectures
Migration learnings	Dead code (causes "context pollution")

Evidence from the Logs

The KGS (Knowledge Graph Service) cutover:

V1: Embedded SQLite-based knowledge graph with "rigid query syntax"
V2: Separate service with "always-helpful natural language interface"
Decision: "Keep as dead code? Rejected. Adds maintenance burden, confuses developers, bloats binary."
Action: Delete 41 Go files (~18,866 lines), clean break

The New Engineering Heuristics

Build the simplest thing that could teach you something
Throw away before you polish
Version number = learning iteration count
"Working code" is a checkpoint, not a destination
If you can't explain a module's responsibility in one sentence, delete and regenerate

The Mindset Shift

"The cost of code is going to zero (in dollars, not time), so I have no ego around just throwing entire systems away once I know how to build things."

This isn't recklessness—it's disciplined iteration. The philosophy and architecture remain constant; only the implementation is fluid.

What Actually Happened

Four months of breaking things.

Before Oct

700 Lines of Hard-Won Bash

Scripts encoding every mistake I kept making. Review gates because I couldn't trust myself.

Oct 2-27

Tools That Remember

MCP let me stop repeating instructions. Zen became my rubber duck. Agents got leashes.

Oct 28 - Nov 25

Throwing Things at the Wall

Design skill, screenshots, Codex, Chrome DevTools. Half of it stuck.

Nov 26+

Models Talking to Models

Clink pipes Claude to other CLIs. I watch the work happen. This is when it got weird.

Era 0: The Shell Script Foundation

Before any MCP infrastructure existed, the core philosophy was codified in two shell scripts:

Script	LOC	What It Did
`review-plan.sh`	~360	Called Gemini to review plans against 7-section template, anti-patterns, walking skeleton methodology
`review-artifact.sh`	~355	Targeted artifact review with plan context, compliance checking

Key insight: The philosophy (TDD, walking skeleton, review gates) predated all tooling. The shell scripts were the MVP implementation. When the MCP was built, it embedded this same philosophy into persistent, stateful tools.

The First "Evals": LLMs Evaluating LLMs

These shell scripts were the first implementation of a key architectural pattern: the model that does the work is not the model that reviews it.

Era 0: Claude implements → Gemini reviews (via shell scripts calling Gemini API)
Era 1-3: Same pattern, but with triage (AUTO_FIX, HUMAN_DECISION, ESCALATE)
Today: Even the evaluation prompts are evaluated by LLMs (meta-level evals)

This separation of execution and evaluation is what enables trusted autonomy at scale.

Evolution: Scripts → 60 MCP Tools

Era	Tooling	State	Models
Era 0	2 bash scripts	Stateless	Single Gemini
Era 1-3	60 MCP tools	SQLite persistence	Multi-model (Gemini, OpenAI, Claude)

525 commits over 5 months transformed the scaffolding. What stayed constant: the philosophy.

Key Milestones

Date	Milestone	Impact
~June	Shell scripts written	Review gates implemented as bash + Gemini
Aug 2	MCP initial commit	Philosophy embedded in Go + SQLite
Oct 2	Log window begins	Project MCP, zen, agent delegation
Oct 28	Design skill introduced	Screenshot-based UI iteration
Nov 8	Codex experiments begin	Multi-model debugging discovered
Nov 26	Clink introduced	Agentic CLI delegation (key unlock)
Dec 13	Foreground subagents	Real-time visibility, better coordination

The "Stuck → Codex" Pattern

When Claude gets stuck debugging:

Recognize stuckness ("issue persists", "same error")
Escalate: "use clink to ask codex to investigate"
Brief: "give it the full context of what you've tried"
Codex identifies root cause (often in one shot)

Why it works: Model diversity breaks reasoning loops.

Where This Started

700 lines of bash.

1

Codify a Task

Write the steps as a checklist. Your first machine-readable process doc.

2

Expose One Tool

Wrap an existing command with structured output. Your linter, test runner, or deploy script.

3

Build a Linter for Logic

Write a validator for the AI's output, not just a better prompt.

Two Scripts

The 60-tool system that produced 543 autonomous hours began here:

review-plan.sh — Gemini API call to validate plans against a checklist
review-artifact.sh — Same pattern for code review

~700 lines of bash. No MCP. No SQLite. Just philosophy encoded in prompts. Everything else grew from there.

Step 1: Codify a Task

## Task: Add a dependency
1. Check if dependency already exists
2. Run `bun add <package>`
3. Verify lockfile updated
4. Run tests to catch breaking changes
5. Commit with message: "deps: add <package>"

Step 2: Expose One Tool

#!/bin/bash
# review-lint.sh
eslint . --format json 2>/dev/null || echo '{"error": "lint failed"}'

Step 3: Build a Linter for Logic

#!/bin/bash
# validate-plan.sh
for section in "Rollback" "Verification" "Security"; do
  grep -q "$section" "$1" || { echo "REJECTED: Missing $section"; exit 1; }
done
echo "APPROVED"

Two-Week Pilot

Week 1: Foundation

Choose one recurring task type
Write the process doc
Wrap 2-3 existing tools
Run AI through manually
Note where it goes wrong

Week 2: Validation

Build one validator script
Add a review gate
Run 5 tasks through
Measure success rate
Document what worked

The Split

5% triggers 48%.

Launch Commands

5% of interactions → 48% autonomous work. One sentence, hours of execution.

Steering Time

46% stays interactive. Reviews, decisions, redirects. That's where judgment lives.

"I decide what gets built. The system handles execution, iteration, testing.
The split happens at the right seam."

PRD → Design → Code → Test → Deploy → Document.
165 releases. Six projects. One person.

Why 46% Stayed Manual

The surprising part isn't that agents run 13 hours unsupervised. It's that 46% stays deliberate. That's the design working.

Review arcs (25%): Quality gates working as intended
Interactive arcs (21%): Deliberate human attention for steering, context, decisions

The system handles reflex-level work automatically. Deliberate work rises to conscious attention. That's the point.

30-Second Redirects

Short interventions that saved hours:

"Why does the review-service need any DB packages?" — 30 seconds → saved 4 hours of wrong coupling
"The deeper issue is dataset per tenant" — 1 minute → saved 10 hours of debugging
"Why Firestore instead of BQ?" — 10 seconds → redirected 8 hours of storage work

This is the 5% that makes the 48% possible. Human pattern recognition catching AI drift, expressed as architectural principles rather than implementation details.

What Made Hands-Off Possible

Enabler	Not This
Task from a queue	Vague goal
Process docs read first	Improvisation
Review gates to pass	Optional checks
State to update	Black box
Standards enforced	YOLO mode
Tools report position	Agent tracks state

The last row is the hidden unlock: stateful tools, stateless agents. Every tool response includes "step 5/12, do this next." The agent doesn't need to remember where it is—the tools tell it.

What Prevents Runaway Agents?

Risk	Constraint
Agent pursues tangent	Task-scoped work from queue
Agent makes breaking changes	review_code gate before completion
Agent can't find answer	Escalation to thinking partner
Agent doesn't report status	Mandatory work_tracking updates
Agent accesses wrong systems	Restricted permission level
Agent's approach is wrong	review_plan gate before implementation

The Operating Principle

"Structure enables autonomy. The 46% interactive is the control surface. The 54% runs because the boundaries are clear."

The Disposable Code Mindset

Fred Brooks said "Plan to throw one away." With AI, this becomes optimal strategy:

Code is ephemeral; architecture is enduring
V1 isn't failure—it's "executable speculation" that reveals requirements
18,866 lines deleted in one cutover wasn't waste—it was tuition
Don't refactor AI's first draft; delete and regenerate with lessons learned

The cost of code is going to zero (in dollars). What remains valuable: interfaces, decisions, tests, and architecture.

The Economic Return on Quality

AI makes the cheap part (code generation) nearly free. But 80% of software cost is maintenance—debugging, rework, technical debt. That's where quality gates pay off:

review_design (1,687 calls): Bad ideas rejected at $1 (design phase)
review_code (867 calls): Defects caught at $10 (development)
review_plan (420 calls): Wrong approaches stopped before implementation

The 2,974 quality gate invocations aren't overhead—they're the mechanism that prevents $100 problems (production bugs, rework, debt). Speed without quality creates liability. Speed with quality creates value.

Playground Supervision

Kindergarten teachers don't control every movement. They set boundaries and give freedom within them.

The fence = infrastructure (task queue, tools, docs, gates)
The rules = review gates that block bad output
The bell = task state transitions
The kids = autonomous agents

Leadership shifts from micromanaging tasks to defining outcomes and constraints.

Scaling Further: From Teams to Organization

Once you're running multiple projects in parallel, a new question emerges: how do you add more?

Bottleneck	Solution
Token limits per project	Separate API accounts, budget allocation
Human attention (5% steering)	Portfolio-style time-boxing, async check-ins
MCP per project overhead	Shared tooling, templated scaffolding
Context switching cost	Consistent patterns across projects
Knowledge silos	Cross-project knowledge base, shared decisions

The new constraint: You can't steer infinite projects simultaneously. The 5% expertise amplification that enables each project still requires human attention. Scaling means optimizing attention allocation, not eliminating it.

The unlock: Shared infrastructure. When the components are templated, adding a new project means copying the boundaries—not reinventing them.

Appendix

"Does it produce
usable code?"

It ships to production. Here's what that looks like.

                    It ships to production. 543 hours → deployed releases.
Review gates enforce quality. 867 review_code calls.
Same standards as humans. Gemini enforces my philosophy.
20,058 Edit calls. Iterative refinement, not one-shot.

                

Wrong Question

"Does AI produce usable code?" isn't the right question. "Does your system ensure usable code?" is. The infrastructure is the answer.

Engineering Department Output (98 days)

This isn't just code volume. It's the artifact diversity of a full engineering organization:

Role	Artifact Type	File Modifications
Backend + Frontend Dev	Application Code (.go, .ts, .js)	64,867
QA Engineer	Test Files (_test.go, .test.ts)	3,038
DevOps/SRE	Terraform, CI/CD, Dockerfiles	1,956
Tech Writer	Documentation (.md)	2,891
Architect	API Schemas, Protos, SQL	504
UI Developer	HTML, CSS, SCSS	569

Total: 90,356 file modifications across 6 active projects.

Automated Verification Layers

Quality isn't one metric. It's a fully automated verification pipeline:

Layer	What It Checks	How
Unit Tests	Code paths work	59% coverage across 41 Go packages
E2E Tests	Deployed services work	Tests run against real infrastructure
Playwright	Browser flows work	Automated UI interaction tests
Visual Verification	UI matches design	Gemini compares screenshots to mocks

All automated. All run by agents. The 165 releases passed through this entire pipeline.

Anti-Slop Indicators

Indicator	Slop	This System
Verification	None	4-layer automated pipeline (unit → E2E → Playwright → visual)
Deployment	Doesn't ship	165 releases passed CI/CD
Planning	No design docs	356 PRDs/specs created
Infrastructure	No infra	1,956 Terraform/CI changes
Documentation	Missing	2,891 markdown files

Sample Design Documents

Real planning artifacts produced during the study period:

DESIGN-PKCE-ENDPOINTS.md — OAuth PKCE implementation spec
PRD-PROJECT-MANAGEMENT-MCP.md — Full product requirements doc
TASK-913-EXPORT-SERVICE.md — Service design for export feature
agent-supervision-framework.md — Agent orchestration architecture
release-management.md — Release workflow documentation

This is the work of a product team, not a code generator.

Appendix

"What does this cost?"

$500 per month

$3.77 per autonomous hour

$17 per day

"Tokens = work getting done.
The more tokens, the more the AI is doing for me."

Appendix

The data source

Every statistic comes from actual Claude Code chat logs.
No surveys. No estimates. Parsed JSON from 97 days.

74 main sessions

2,314 subagent sessions

650 arcs analyzed

97 days of data

Data Source

Claude Code stores full conversation transcripts in ~/.claude/projects/*/ as JSONL files. Each line is a timestamped message with type, content, and tool calls.

Main sessions: UUID.jsonl files (can exceed 1GB for long-running sessions)
Agent sessions: agent-*.jsonl files (1-50MB per agent)
Message format: {"type": "human"|"assistant"|"tool_use", "message": {...}, "timestamp": "..."}

Arc Detection Algorithm

An arc is a period of autonomous work initiated by a single user prompt. We built a Python tool (arc_analyzer.py) that:

Parses JSONL files line-by-line (streaming for memory efficiency)
Detects arc starts via regex patterns:
- burn\s+down.*(?:release|R\d+|tasks) — Release burn down
- spawn.*(?:restricted|foreground).*agent — Explicit delegation
- please.*review.*docs\/prompts.*spawn — Full delegation template
Tracks agents via Task tool calls and their completions
Detects arc ends: next substantive human prompt or completion message
Classifies arcs by duration and agent count into 7 types

Classification Logic

if agents_spawned == 0:
    if duration < 15 min and "review" in prompt:
        type = "review"
    else:
        type = "interactive"
elif duration < 15 min:
    type = "quick"
elif duration < 60 min:
    type = "build"
elif "debug" in prompt:
    type = "debug"
elif duration < 240 min:
    type = "feature"
else:
    type = "release"

Verification Methods

Claim	Verification
650 arcs detected	arc_analyzer.py stats command output
29 release arcs	arc_analyzer.py list --type release
5% → 48% power law	29/650 = 4.5%, 299 hrs / ~620 total hrs = 48%
v2.5 example	Manual inspection of session UUID 9081bc23-*
543 autonomous hours	Sum of subagent session durations from timestamps

What This ISN'T

NOT an estimate: Exact counts from log files
NOT a survey: Machine-parsed from actual conversations
NOT cherry-picked: ALL 650 arcs from the 97-day period are included
NOT hallucinated: Methodology is documented; same tools can analyze your own logs

Analyze Your Own Logs

The same methodology can be applied to your own Claude Code logs at ~/.claude/projects/:

# Extract arcs from your sessions
python arc_analyzer.py extract

# Show arc statistics
python arc_analyzer.py stats

# List release arcs only
python arc_analyzer.py list --type release

# Generate full report
python arc_analyzer.py report

The tool and methodology are fully documented. The numbers hold up to scrutiny because they're measured from real logs, not modeled or estimated.

What happens when AIruns while you sleep

Introduction

How to Read This Document

What You'll Learn

The Journey

The Core Claim

Everything here came fromone developer's chat logs.

Why This Matters

The Leverage Equation

Infrastructure, Not Talent

What "One Prompt" Actually Produced

The Four Phases That Made This Possible

What Actually Shipped

Autonomous Project Management

The Orchestration Loop

Why This Matters

Scale of the Work

The Actual Prompt (Implementation Phase)

What Made This Possible: The Resource Documents

The Other Enabler: Structured Data

Terminology

Chat is not a workflow.

Two Modes of AI Use

What Makes Autonomous Mode Possible

Related Reading

29 release arcs.Same structure every time.

The Boring Miracle

The Numbers

The Boring Part

Duration Distribution

Why It's Reproducible

The Investment Behind Reproducibility

v2.5 In Context

Not time-limited anymore.Token-limited.

The Bottleneck Shifted

Why This Matters

In Practice

The Math

Organization-Level Output

The math that makeshands-off work.

Cost Breakdown

The Mindset Shift

The Leverage Math

But Is It Slop?

46% stays interactive.That's the point.

The Three Tiers

Five Key Lessons from 650 Arcs

Arc Type Breakdown (650 arcs analyzed)

Chat is not autonomy.

Conversational AI (Today)

Orchestrated Autonomy (Goal)

Two Modes of Interaction

Structured Commands

Adaptive Collaboration (58%)

What would you givea new contractor?

What You Give a New Contractor

The Parallel

But There's a Deeper Insight...

Three levels of AI collaboration.

Conversational Copilot

Tool-Using Assistant

Orchestrated Autonomy

What Moves from Human to System

Delegate outcomes,not tasks.

The Kindergarten Teacher Lesson

The Trigger Prompt

Task Delegation vs Outcome Delegation

Infrastructure as Boundaries

Why This Matters

Four things agents need.

1. A Task Database

2. Real Tools

3. Process Docs

4. Review Gates

How the Infrastructure Grew

Philosophy First, Tools Second

The Key Architectural Choice: LLMs Evaluating LLMs

The Task Database

Why This Matters

Implementation

What happens when AI
runs while you sleep

Everything here came from
one developer's chat logs.

29 release arcs.
Same structure every time.

Not time-limited anymore.
Token-limited.

The math that makes
hands-off work.

46% stays interactive.
That's the point.

What would you give
a new contractor?

Delegate outcomes,
not tasks.

Writing code is cheap.
Owning it is expensive.

5% steering.
100% implementation.

Brooks said throw one away.
Now you can afford to.