97 Days of Logs
543 hours of work I didn't do myself
One person. Six concurrent projects. 165 shipped releases. The infrastructure that made it possible—and the receipts to prove it.
This presentation has three viewing modes. Press D to cycle between them:
| Mode | Purpose | Best For |
|---|---|---|
| Presentation | Full-screen slides with key insights | Talks, quick overview |
| Detail | Slide + expanded analysis below | Deep reading, exploration |
| Document | Scrollable long-form with all content | Reference, printing |
The slides tell the story. The detail sections provide evidence, methodology, and nuance. You're currently in Detail or Document mode—that's why you can see this text.
This document answers one question: How does a user get AI coding assistants to work autonomously over long periods to produce complex deliverables at an acceptable level of quality?
This is shown through a forensic examination of chat logs with notes about why the user made the relevant decisions. It is taken from 97 days of real logs from one developer running six concurrent projects with Claude Code.
| Section | What It Covers |
|---|---|
| The Hook | A single prompt that triggered 13 hours of autonomous work |
| The Numbers | 543 hours, roughly $500/month, and what the math actually looks like |
| The Pyramid | Where human time goes when AI handles execution |
| The Infrastructure | Four pillars that make long-range autonomy possible |
| The Economics | Why quality gates pay for themselves |
| Start Monday | Concrete steps to begin building your own scaffolding |
| Appendices | Methodology, cost breakdown, code quality evidence |
With the right infrastructure, one person can produce the output of multiple engineering teams in parallel—not by working harder, but by building scaffolding that lets AI magnify the human strengths.
The rest of this document shows exactly how it works, with receipts.
The Data
Not a team. Not a survey. Parsed directly from 97 days of Claude Code sessions— six concurrent projects, 2,314 agent sessions, all verifiable.
14,926 prompts · 2,314 agent sessions · 543 autonomous hours
When you see aggregate statistics—thousands of prompts, hundreds of hours—the natural assumption is "a team did this." That assumption makes the data feel distant, organizational, not personally achievable.
The truth: this is what happens when you build the right infrastructure—and let it run.
Traditional scaling requires hiring. You need more people to do more work. The constraint is headcount, budget, coordination overhead.
With the right AI scaffolding, one person can:
This output isn't the result of working 16-hour days. It's the result of:
The rest of this presentation shows exactly how it works—and how you can replicate it.
The Hook
Click to reveal
"I typed one sentence. Went to bed. Woke up to a deployable release."
This isn't replicating the legacy process by spawning agents to play the roles we assign to humans in legacy practice. The orchestrator executed a well-structured plan—reading dependency chains from the database, sequencing work accordingly, monitoring progress, and adapting.
That 13-hour autonomous run was part of a larger workflow:
| Phase | Mode | What Happens |
|---|---|---|
| 1. Exploration | Interactive | Human + AI discover what to build. Back-and-forth discussion, research, prototyping ideas. |
| 2. Planning | Largely autonomous | AI creates tasks, captures dependency chains, structures the release. Gemini validates via review_plan. |
| 3. Implementation | Fully autonomous | Orchestrator reads the plan from the database and executes it. This is the "13 hours" part. |
| 4. Review | Autonomous loop | Gemini runs review_code on output. Agent fixes issues until it passes. 867 review calls total. |
Key insight: The orchestrator didn't "figure out" dependencies—it read dependency chains that were captured during planning. The leverage is in the setup, not just the execution.
After receiving the prompt, the orchestrator's first action was to analyze the work:
"Current v2.5 Status: 60 total tasks (13 cancelled). 40 tasks need design (must go through design_documents → review_design before implementation). 7 tasks in todo (can be claimed directly)."
It identified that most tasks were blocked by dependencies. Its strategic response:
After each wave completed, the orchestrator checked status and decided what to do next:
| Time | Progress | Orchestrator Decision | Agents |
|---|---|---|---|
| +0.0h | 0% | Analyze dependencies → spawn design + frontend agents | 2 |
| +0.6h | 18% | Design done → spawn implementation agents for Phases 1-3 | 3 |
| +11.0h | 58% | Core complete → spawn agents for remaining work | 3 |
| +12.3h | 68% | Almost done → spawn final frontend + test agents | 2 |
| +12.9h | 100% | "v2.5: Event Processing Pipeline - COMPLETE!" | — |
The orchestrator did the work of a project manager executing a plan:
The heavy lifting happened earlier: during exploration (figuring out what to build) and planning (capturing the structure and acceptance criteria). The autonomous execution during implementation is return on that up-front investment.
Analysis of the actual session logs shows what those 10 agents did:
| Metric | Count | What It Means |
|---|---|---|
| Shell commands | 21,464 | Tests run, builds triggered, git ops |
| File reads | 8,474 | Understanding existing code |
| Code edits | 6,670 | Modifications to source files |
| Files created | 983 | New source files written |
| Source files touched | 1,212 | .go, .ts, .tf, .proto, etc. |
| Total tool calls | 66,325 | Autonomous actions taken |
Note: These numbers are from parsing the actual JSONL chat logs, not estimates.
Please review the docs/prompts exposed as resources by the project MCP to understand the process then spawn appropriate restricted agents in the foreground to burn down the tasks in [Release v2.5].
That's it. 47 words. But this prompt only works because of what it points to.
The key phrase is "review the docs/prompts exposed as resources." Those resources define the entire workflow:
| Resource | What It Teaches the AI |
|---|---|
development-workflow-sequence | Complete methodology: walking skeleton approach, 4-phase workflow, quality gates |
determine-needed-agents | Decision matrix: when to spawn agents vs. do work yourself, capacity guidelines |
escalation-decision-matrix | 5-level escalation framework with specific triggers for each level |
coordinator-troubleshooting-guide | Real-world patterns: "agent claims are 50% accurate—always verify with database" |
These aren't vague guidelines—they're operational playbooks with decision trees, copy-paste commands, and hard-won lessons from production use. The AI reads them, internalizes the process, then executes it.
The workflow lives in the resources, not in the prompt.
The resources teach the process. The database provides the data:
SELECT * FROM tasks WHERE release_id = 'v2.5' AND status = 'ready' AND all_dependencies_complete = true
Resources define how to work. The database defines what to work on. Together, they enable a 47-word prompt to trigger 13 hours of coherent execution.
| Term | Definition |
|---|---|
| Arc | A period of autonomous work triggered by one user prompt. Can last minutes (quick arc) or hours (release arc). |
| Orchestrator | The main AI session that manages the workflow—analyzing tasks, spawning agents, monitoring progress. |
| Work item | A discrete unit of work tracked in the project database—a feature, bug fix, test, or refactor. |
| Wave | A batch of agents spawned to work in parallel, followed by a status check before the next wave. |
The Question
Most AI adoption stops at "type a prompt, get a response." That's using a power tool as a hammer. What happens when you build around it instead?
"The gains don't come from better prompts. They come from different infrastructure."
Most people use AI in one mode:
The second mode is different:
You can't just "tell the AI to do more." You need infrastructure:
The rest of this presentation shows what that infrastructure looks like—and how it produces 543 hours of work from one person.
Ivan Zhao's essay "Steam, Steel, and Infinite Minds" (December 2025) explores similar themes—how AI changes knowledge work at an organizational level.
The Pattern
Click to reveal the pattern
"The 13-hour run wasn't an outlier—it was just Tuesday."
If the v2.5 story impressed you, you should be skeptical. One impressive demo proves nothing.
But over 97 days, we ran the exact same pattern 29 times. That's not luck—it's a stable orbit.
Here's the most boring part of this story: the prompt never changed.
For all 29 release arcs—spanning 298 of the 543 total autonomous hours—the trigger was identical:
Please review the docs/prompts exposed as resources by the project MCP to understand the process then spawn appropriate restricted agents in the foreground to burn down the tasks in [Release].
Stable input. Stable system. Stable output.
The 29 release arcs ranged from 4.1 hours to 17.3 hours:
| Duration | Count | Examples |
|---|---|---|
| 4-6 hours | 6 | Smaller releases, single-domain work |
| 6-10 hours | 5 | Medium complexity, multiple services |
| 10-14 hours | 14 | Full feature releases like v2.5 |
| 14+ hours | 4 | Large multi-system integrations |
The pattern works because the scaffolding is consistent:
The orchestrator isn't doing magic. It's following a well-documented process—the same way a good PM would.
This scaffolding didn't appear overnight. It evolved over 5 months:
What the 29 release arcs demonstrate is the payoff—a mature system in operation. The scaffolding was built incrementally, one tool at a time.
The v2.5 release (highlighted in teal above) was our 21st release arc. At 12.9 hours, it was actually slightly above average—not exceptional.
What made it a good example for this presentation:
But any of the 29 would tell a similar story.
The Constraint Changed
543 hours of agent work in 3 months—across 6 parallel projects.
Scaling = more projects. Bottleneck = your bandwidth to steer them.
| Era | Bottleneck | Scaling Strategy |
|---|---|---|
| Traditional | Human hours | Hire more people |
| Copilot | Human attention | AI assists, human still bottleneck |
| Autonomous | API tokens | Run more agents in parallel |
Human time is finite and non-purchasable. API tokens are purchasable. This changes the economics:
I hit Claude Code's token limits when trying to scale up. The constraint is no longer my calendar—it's my API budget. That's a good problem to have.
"The goal isn't to make AI faster. It's to remove you from the critical path."
This isn't just volume—it's multiple engineering teams running in parallel:
| Project | Sessions | Domain |
|---|---|---|
| project-mcp | 774 | The MCP powering this workflow |
| saas-platform | 756 | SaaS product |
| web-portal | 182 | Web platform |
| workflow-platform | 141 | Legacy polyglot event-driven microservices monorepo |
| api-v2 | ~200+ | Monorepo SaaS |
| cloud-finops | 305 | AI-powered FinOps intelligence platform |
Each project gets the equivalent of a 5-7 person team. One person + AI scaffolding = output of multiple engineering teams in parallel.
The Numbers
Power law: 29 release arcs (avg 10h each) delivered nearly half of all autonomous work.
| Item | Monthly Cost |
|---|---|
| Claude Max+ (2 accounts, rotating) | $400 |
| OpenAI/Google API (Codex, Gemini) | ~$100 |
| Total | ~$500/month |
"I used to try to conserve tokens. That's the wrong mental model. Tokens = work getting done. The more tokens I'm using, the more the AI is doing for me. The flat rate helped me get over that mental barrier."
543 hours of autonomous work ÷ ~3 months = ~6 hours/day of Claude working independently.
At $100-150/hour for an engineer, that's $54,000-81,000/month equivalent for $500.
Note: This is the payoff from 5 months of scaffolding development—525 commits building from shell scripts to 60 MCP tools.
No. Here's why:
| Anti-Slop Indicator | Evidence |
|---|---|
| Verification exists | 4-layer automated pipeline (unit → E2E → Playwright → visual) |
| It deploys | 165 releases passed CI/CD pipelines |
| It was planned | 356 design docs (PRDs, specs, architecture) |
| Infra is real | 1,956 Terraform + CI/CD file modifications |
| It's documented | 2,891 markdown file modifications |
| It's been reviewed | 2,974 quality gate checks (WAF pillars, pattern enforcement) |
Slop is code-only, test-free, and doesn't ship. This is PRD → Design → Code → Test → Deploy → Document. Full vertical stack.
More importantly: quality is enforced at the $1 phase (design review), not discovered at the $100 phase (production). The 2,974 gate checks aren't overhead—they're the reason this work has value.
The Key Discovery
"The 46% is where decisions get made. That's the work."
| Tier | Arc Types | % of Arcs | % of Hours | Purpose |
|---|---|---|---|---|
| Steering & Alignment | interactive, review | 46% | 22% | Human judgment, decisions, quality gates |
| Momentum | quick, build | 37% | 10% | Routine tasks, keep progress moving |
| Value Delivery | feature, release, debug | 18% | 68% | Major work, overnight capability |
| Type | % | Avg Duration | Agents | Description |
|---|---|---|---|---|
| review | 24.9% | 23 min | 0 | Human-driven code/design review |
| quick | 22.0% | 5 min | 2.7 | Fast single tasks |
| interactive | 20.9% | 33 min | 0 | Direct conversation, Q&A |
| build | 14.5% | 33 min | 4.4 | Test/build cycles |
| feature | 11.8% | 112 min | 9.4 | Multi-task implementation |
| release | 4.5% | 618 min | 8.7 | Full release burn down |
| debug | 1.4% | 118 min | 1.2 | Investigation cycles |
The Problem
You're the glue code.
The system is the API.
Analysis reveals 42% of prompts are repetitive commands (structured), while 58% are adaptive collaboration (context-specific steering).
| Pattern | Count | Example |
|---|---|---|
| Context compaction | 703 | /compact |
| Delegation template | 403 | "Please review docs/prompts..." |
| Confirmations | 376 | "Yes please" |
| Review tools | 324 | "Run review_code on R70" |
The "noise" represents real-time steering: bug investigation, architecture decisions, UI feedback, cross-session coordination. This is the human-in-the-loop providing context templates can't capture.
The Setup
"You wouldn't hand someone a ticket and say 'fix the bug' then disappear. You'd point them to the repo, show them how to run tests, explain the standards, and ask for a PR."
That's the infrastructure this presentation is about.
AI needs the same scaffolding. The difference: you can codify it once and reuse it forever. The process docs become MCP resources. The review criteria become automated gates. The feedback becomes tool output.
This metaphor is useful—but it's still task-oriented. You're thinking about how to guide someone through work.
The real unlock is moving from task delegation to outcome delegation. From micromanaging steps to setting boundaries.
We'll return to this after examining the collaboration spectrum.
The Spectrum
Text-in, text-out. Human is the API. Most teams are here.
AI has tools (file read, shell, search). Still reactive—can't plan or chain.
AI operates within a system. Reads queue, spawns workers, enforces standards.
| Responsibility | L1 | L2 | L3 |
|---|---|---|---|
| Code generation | AI | AI | AI |
| File operations | Human | AI | AI |
| Running tests | Human | AI | AI |
| Task planning | Human | Human | AI (validated) |
| Quality review | Human | Human | System (gates) |
| Progress tracking | Human | Human | System (state) |
| Error recovery | Human | Human | AI (respawn) |
The leap from L2 to L3: The system handles planning, review, and state—not just execution.
The Shift
"Write function X, then call it from Y, then update the tests."
"Complete Release 5.9."
"The infrastructure isn't instructions. It's boundaries.
Inside those boundaries: freedom."
In US military leadership training, commanders are sent to observe an elementary school playground. They arrive thinking:
"I am going to control every movement of every kid on that playground."
Of course, this doesn't work. No commander—no matter how skilled—can micromanage 30 children at recess.
But the kindergarten teacher succeeds effortlessly. How?
"Set boundaries. Give freedom within them."
This is exactly how L3 autonomy works.
Here's the actual prompt that launched 29 release arcs totaling 298 hours:
Please review the docs/prompts exposed as resources by the project MCP to understand the process then spawn appropriate restricted agents in the foreground to burn down the tasks in [Release].
Notice what it doesn't specify:
It only specifies:
| Aspect | Task Delegation (L2) | Outcome Delegation (L3) |
|---|---|---|
| You specify | HOW to do each step | WHAT success looks like |
| Agent decides | Nothing (follows script) | Implementation details |
| Control via | Instructions (prescriptive) | Boundaries (prohibitive) |
| Scales to | Minutes of work | Hours of work |
| Human role | Operator | Executive |
Each component constrains a failure mode:
| Component | Not This (Instructions) | But This (Boundaries) |
|---|---|---|
| Task Database | "Work on Task 42" | "Here's the queue; claim what's ready" |
| Real Tools | "Use grep, then sed" | "Here are your tools; choose wisely" |
| Process Docs | "Follow steps 1-10" | "Here's the process; adapt to context" |
| Review Gates | "Format code this way" | "Pass review or revise" |
Task delegation doesn't scale. If you have to specify every action, you become the bottleneck. You've just built a voice-controlled IDE.
Outcome delegation scales. You define success once, encode the boundaries, and let agents find the path. The boundaries scale. You don't have to.
This is why the 13-hour release arc was possible. Not because the AI is smart—but because the boundaries were clear.
The Infrastructure
Machine-readable queue with status, dependencies, blocking rules.
11,956 tracking callsSame access as human engineers: file system, shell, search, deploy.
56,315 Bash callsTools report position: "step 5/12, next: run tests." Stateful tools, stateless agents.
13h release arcsAutomated checks that catch errors. A different model reviews the work.
2,974 review callsThese define the boundaries. Inside them: freedom.
This infrastructure wasn't built in a week. It emerged over 5 months of iterative development:
| Component | Started As | Evolved Into |
|---|---|---|
| Task Database | Manual task lists | SQLite task queue with status machine, dependencies, auto-transitions |
| Real Tools | Basic Bash + Read/Write | 60 MCP tools including knowledge graph, analytics, LLM guidance |
| Process Docs | Informal patterns | Process docs exposed as MCP resources, codified delegation prompts |
| Review Gates | 2 shell scripts (~700 LOC) | review_code, review_plan, review_design with multi-model validation |
The guiding principle—walking skeleton, TDD, objective verification—was codified in CLAUDE.md before any tooling existed. The shell scripts implemented review gates. The MCP embedded them into persistent tools. The multi-model orchestration refined them further.
What never changed: The philosophy. What evolved: the implementation.
Review Gates have a specific mechanism: the model that does the work is not the model that reviews it.
This separation of execution and evaluation is what makes the 543 autonomous hours trustworthy. See the Review Gates deep-dive for the full mechanism.
Infrastructure 1
A SQLite queue with status machine, dependencies, and blocking rules.
11,956 calls to tracking/planning tools means agents are constantly asking: "What is my goal? What is the current status? How do I report progress?"
Without structured state, the agent is flying blind. With it, the orchestrator knows what to do next.
A project database (SQLite via MCP) with tables for tasks, releases, and status. Agents query it constantly, update it as they work, and the orchestrator uses it to decide what to spawn next.
Infrastructure 2
Same access as human engineers: file system, shell, search, deploy.
Bash (execute) > Read (understand) > Edit (change) > Grep (search). This is the same ratio you'd see from a productive engineer.
"The bulk of work isn't abstract reasoning—it's concrete, small, verifiable actions."
Don't build new AI-specific tools. Wrap your existing linter, test runner, and deploy script. The AI can use the same commands you use.
Infrastructure 3
Tools report position: "step 5/12, do this next." Stateful tools, stateless agents.
Please review the docs/prompts exposed as resources by the project MCP to understand the process then spawn appropriate restricted agents in the foreground to burn down the tasks in [Release]. Remind them to work with pal as a partner and the tools in the project MCP.
| Type | Duration | Use Case | Example |
|---|---|---|---|
| Quick arcs | < 15 min | Single tasks, fixes | Run tests, add endpoint |
| Release arcs | 2-13+ hours | Full features, overnight | v2.5: 47 items, 66K tool calls |
The workers are composable units. The orchestration period is the true autonomous window—an orchestrator can spawn dozens of workers over hours.
Note: The 40+ min workers include cases like agent-ad7836b which ran for 10h 17m during v2.5, handling complex backend implementation autonomously.
Here's the secret: agents don't memorize the route. Every tool response includes workflow guidance:
{
"step_number": 5,
"total_steps": 12,
"next_step_required": true,
"required_actions": ["Run tests", "Update documentation"],
"guidance": "Implementation complete. Verify tests pass before proceeding.",
"auto_fix_tasks": [...],
"human_decisions_needed": [...],
"escalations": [...]
}
Stateful Tools, Stateless Agents. The agent doesn't need to maintain context for 10 hours. It executes a task, gets a result, and the tool tells it what's next. The long-running arc is managed by the orchestrator and stateful tools, not a single, fragile agent context.
| Component | Analogy | What It Does |
|---|---|---|
| Agent | Driver | Executes current step |
| Tool Response | GPS Voice | "In 500m, turn left" |
| Task Queue | Route Plan | All waypoints to destination |
| Workflow Guidance | GPS Display | "Step 5/12, next: run tests" |
This is why one simple prompt can trigger 13 hours of coherent work. The project-mcp is portable scaffolding—same tools, same workflow guidance, applied to any project.
Infrastructure 4
The model that does the work is not the model that reviews it.
The model that does the work is not the model that reviews it.
| Role | Who | What They Do |
|---|---|---|
| Legislator | Human | Encodes judgment in criteria (CLAUDE.md, prompt templates) |
| Executive | Claude | Implements—writes code, creates designs, executes tasks |
| Judiciary | Gemini | Evaluates against criteria, enforces standards |
This separation prevents: AI reviewing its own work (conflict of interest), human reviewing everything (doesn't scale), or no review (dangerous).
Every review finding gets categorized:
| Category | What Happens | Example |
|---|---|---|
| AUTO_FIX | Reflex-level response—agent handles automatically | Formatting, unused imports, simple error handling |
| HUMAN_DECISION | Requires deliberate attention—escalate with options | Algorithm choice, API design trade-offs |
| ESCALATE | Stop work, require human review | Architecture changes, security concerns |
Result: Human reviews 0% of AUTO_FIX, some HUMAN_DECISION, all ESCALATE. The criteria scale. You don't have to.
"I encoded my engineering judgment in 360 lines of shell script. Gemini applied it 2,974 times. The criteria scaled. I didn't have to."
The MCP tooling shapes what agents produce. Then the eval checks both structure and spirit:
| Artifact | Structured Fields | Objective Checks | Subjective Checks |
|---|---|---|---|
| Release | Tasks, phases, dependencies | All tasks have IDs? Dependencies valid? | Does this represent a vertical slice? |
| Task | Acceptance criteria, priority, estimated hours | Has acceptance criteria? Measurable? | Right size? Not over-engineered? |
| Plan | 7-section structure | All sections present? Success criteria objectively measurable? | Embodies walking skeleton? WAF pillars addressed? |
| Design | Problem analysis, proposed solution, risk assessment | Schema complete? Alternatives considered? | Scope matches task? Not gold-plating? |
| Code | Files, functions, test coverage | Tests exist? Builds pass? | Follows project patterns? WAF security pillar? |
Key insight: The MCP tooling ensures agents produce structured artifacts with required fields. The eval then verifies both the structure (objective) and the spirit (subjective).
When review_plan runs, it applies Google's Well-Architected Framework pillars in priority order:
This is how "embody the spirit of WAF" becomes a concrete, repeatable evaluation—not a vague guideline.
| Anti-Pattern | What Gemini Flags |
|---|---|
| TDD Micro-management | Tasks named "RED", "GREEN", "REFACTOR" as separate items |
| Premature Infrastructure | CI/CD before core functionality works |
| Horizontal Slicing | "Database Layer", "API Layer" instead of vertical features |
| Integration Assumptions | External API tasks without de-risking spikes |
| Over-engineering | Design scope exceeds the specific task |
The review tool requires 7 mandatory sections:
The Economics
Initial Development
Lifecycle Cost
2,974 quality gates isn't bureaucracy.
It's aggressive asset protection.
The cost to fix a defect rises exponentially the later you find it:
Scenario: AI generates 10,000 lines of code in 1 hour.
AI makes code generation nearly free. That's precisely why quality gates become more important, not less.
| Gate Type | Count | Value |
|---|---|---|
| review_design | 1,687 | High Leverage: Stopped bad ideas before code was written. |
| review_code | 867 | Debt Prevention: Enforced patterns, prevented "spaghetti". |
| review_plan | 420 | Alignment: Ensured we built the right thing. |
The review_code tool isn't a simple linter. It's a fully agentic process with access to:
It verifies that implementation matches design, code meets task requirements, and all artifacts align. This isn't validation—it's verification.
"The most valuable work the AI did wasn't the code it wrote.
It was the 2,000+ times it told me 'No'."
Smart Triage
Every time you make a call, the system records it. Next time: fewer questions.
"Simple stuff gets fixed automatically. Hard stuff gets escalated with context."
Every human decision gets captured to a per-project Knowledge Base. Next time a similar question arises, the system checks existing decisions before asking:
| Concern | How the System Answers It |
|---|---|
| "How do I trust it?" | Transparent triage with confidence scores + rationale |
| "Won't I repeat myself?" | KB remembers decisions per-project |
| "How does it scale?" | Each decision increases future autonomy |
From review_code prompt template:
**Triage Category Guidelines:** - AUTO_FIX: Simple, mechanical fixes that don't change logic or architecture - HUMAN_DECISION: Issues requiring design choices, trade-offs, or architectural decisions - ESCALATE: Major issues requiring significant refactoring or cross-team coordination **Confidence**: Rate your confidence (0.0-1.0) **Rationale**: Why this triage category was chosen
As the Knowledge Base grows:
This is institutional knowledge that survives team changes, onboards new agents, and compounds over time.
Leverage
Human expertise flows into the system in two ways.
Review criteria, CLAUDE.md, process docs. Applied 2,974 times by evals.
Occasional interventions. Short messages that redirect entire arcs.
"One sentence about bounded contexts saved 40 hours of wrong implementation."
These are actual human interventions from the MCP development logs:
"Why does the review-service need any DB packages at all? It just uses tools to pull what it needs, it should not do direct DB access because that is not part of its bounded context."
Impact: Prevented tight coupling between services. The review-service became a stateless orchestrator that only uses tool interfaces.
"Please examine the knowledge graph functionality... work with pal to come up with the DDD plan and relevant bounded contexts and then create a design for this new, extracted service."
Impact: Kicked off proper architectural thinking. AI identified 4 bounded contexts in 6,200 LOC of code, designed hexagonal architecture with proper domain layer separation.
"Why are we putting this into Firestore instead of BQ?"
Impact: 8-word question that redirected the entire storage strategy. AI had defaulted to Firestore for graph data; corrected to BigQuery for analytical workloads.
"The deeper issue is that we are supposed to have a dataset per tenant. Right now, it seems like the schema is very wrong, with a dataset per microservice."
Impact: Human diagnosed the root cause of tenant isolation issues. Fundamentally shifted the infrastructure architecture from service-per-dataset to tenant-per-dataset.
| Intervention | Human Time | Implementation Time Saved/Redirected |
|---|---|---|
| Bounded context correction | ~30 seconds | ~4 hours of wrong coupling |
| DDD architecture mandate | ~2 minutes | ~20 hours of design + implementation |
| Storage architecture redirect | ~10 seconds | ~8 hours of wrong storage layer |
| Multi-tenancy diagnosis | ~1 minute | ~10 hours of debugging |
Total: ~4 minutes of human expertise steering ~42 hours of implementation.
Human expertise enters the system through two complementary mechanisms:
The first loop scales without limit. The second loop ensures the first loop stays calibrated.
| Pattern | Example |
|---|---|
| Socratic questions | "Why are we putting this in Firestore?" |
| Architectural principles | "not part of its bounded context" |
| Root cause identification | "The deeper issue is..." |
| Quality standards | "enterprise-grade domain driven design... elegant" |
Notice: None of these tell the AI how to implement. They tell it what principles to apply.
Real Example
It came back with news I could act on.
Create smoke test for undo feature. Test failed: unknown message type: <nil>
"The undo code is in the repo but not deployed here. I can't test a feature that isn't running. Deploy it first, then I'll rerun the test."
No invented solution. Just: "here's the wall, here's why I stopped."
The alternative failure modes are worse:
This agent did the right thing: investigated, found the real blocker (deployment gap), and reported back with actionable information.
This agent sent 413 messages over 6h 42m. Breakdown:
| Tool | Count | Purpose |
|---|---|---|
| Bash | 84 | Running tests, git ops, deployments |
| Read | 25 | Understanding existing code |
| Grep | 16 | Finding implementations |
| Edit | 13 | Creating/modifying test script |
| Glob | 6 | Finding files |
| TodoWrite | 5 | Tracking progress |
| MCP tools | 4 | Design docs, project status |
You can walk away because when it hits a wall, it tells you which wall and why. Both outcomes give you actionable information. What's unacceptable is silent garbage or invented workarounds.
The Philosophy Shift
"Plan to throw one away; you will anyway."
V1 is the question. V2 is the answer.
The first version's job is to be wrong in useful ways.
| Resource | Traditional | AI-Assisted |
|---|---|---|
| Code production | Expensive (human-months) | Cheap (tokens) |
| Human attention | Available | The bottleneck |
| Throwaway code | Waste | Investment in understanding |
"We wrote 18,866 lines specifically to learn why we shouldn't keep them."
If implementation costs tokens (cheap) instead of human-months (expensive), building V1 becomes the optimal requirements gathering technique.
You're not building to ship. You're building to discover what you should build.
A mindset for AI-assisted development:
| Phase | Purpose | Outcome |
|---|---|---|
| V1 (Probe) | Let AI build the feature | Discover where boundaries should be |
| Audit | Read the code for structure, not syntax | Identify coupling, duplication, gaps |
| V2 (Structure) | Regenerate from scratch with lessons | Clean implementation with proper interfaces |
Key insight: Don't refactor V1. Delete and regenerate. The cost of fixing AI's "first draft" assumptions often exceeds the cost of a clean V2.
| Preserve (Human Attention Artifacts) | Throw Away (Token Artifacts) |
|---|---|
| Interface contracts, type definitions | Implementation details |
| Design decisions and rationale | Code that no longer fits |
| Test suites (the "truth" of the system) | First-draft architectures |
| Migration learnings | Dead code (causes "context pollution") |
The KGS (Knowledge Graph Service) cutover:
"The cost of code is going to zero (in dollars, not time), so I have no ego around just throwing entire systems away once I know how to build things."
This isn't recklessness—it's disciplined iteration. The philosophy and architecture remain constant; only the implementation is fluid.
What Actually Happened
Scripts encoding every mistake I kept making. Review gates because I couldn't trust myself.
MCP let me stop repeating instructions. Zen became my rubber duck. Agents got leashes.
Design skill, screenshots, Codex, Chrome DevTools. Half of it stuck.
Clink pipes Claude to other CLIs. I watch the work happen. This is when it got weird.
Before any MCP infrastructure existed, the core philosophy was codified in two shell scripts:
| Script | LOC | What It Did |
|---|---|---|
review-plan.sh | ~360 | Called Gemini to review plans against 7-section template, anti-patterns, walking skeleton methodology |
review-artifact.sh | ~355 | Targeted artifact review with plan context, compliance checking |
Key insight: The philosophy (TDD, walking skeleton, review gates) predated all tooling. The shell scripts were the MVP implementation. When the MCP was built, it embedded this same philosophy into persistent, stateful tools.
These shell scripts were the first implementation of a key architectural pattern: the model that does the work is not the model that reviews it.
This separation of execution and evaluation is what enables trusted autonomy at scale.
| Era | Tooling | State | Models |
|---|---|---|---|
| Era 0 | 2 bash scripts | Stateless | Single Gemini |
| Era 1-3 | 60 MCP tools | SQLite persistence | Multi-model (Gemini, OpenAI, Claude) |
525 commits over 5 months transformed the scaffolding. What stayed constant: the philosophy.
| Date | Milestone | Impact |
|---|---|---|
| ~June | Shell scripts written | Review gates implemented as bash + Gemini |
| Aug 2 | MCP initial commit | Philosophy embedded in Go + SQLite |
| Oct 2 | Log window begins | Project MCP, zen, agent delegation |
| Oct 28 | Design skill introduced | Screenshot-based UI iteration |
| Nov 8 | Codex experiments begin | Multi-model debugging discovered |
| Nov 26 | Clink introduced | Agentic CLI delegation (key unlock) |
| Dec 13 | Foreground subagents | Real-time visibility, better coordination |
When Claude gets stuck debugging:
Why it works: Model diversity breaks reasoning loops.
Where This Started
Write the steps as a checklist. Your first machine-readable process doc.
Wrap an existing command with structured output. Your linter, test runner, or deploy script.
Write a validator for the AI's output, not just a better prompt.
The 60-tool system that produced 543 autonomous hours began here:
review-plan.sh — Gemini API call to validate plans against a checklistreview-artifact.sh — Same pattern for code review~700 lines of bash. No MCP. No SQLite. Just philosophy encoded in prompts. Everything else grew from there.
## Task: Add a dependency 1. Check if dependency already exists 2. Run `bun add <package>` 3. Verify lockfile updated 4. Run tests to catch breaking changes 5. Commit with message: "deps: add <package>"
#!/bin/bash
# review-lint.sh
eslint . --format json 2>/dev/null || echo '{"error": "lint failed"}'
#!/bin/bash
# validate-plan.sh
for section in "Rollback" "Verification" "Security"; do
grep -q "$section" "$1" || { echo "REJECTED: Missing $section"; exit 1; }
done
echo "APPROVED"
The Split
5% of interactions → 48% autonomous work. One sentence, hours of execution.
46% stays interactive. Reviews, decisions, redirects. That's where judgment lives.
"I decide what gets built. The system handles execution, iteration, testing.
The split happens at the right seam."
PRD → Design → Code → Test → Deploy → Document.
165 releases. Six projects. One person.
The surprising part isn't that agents run 13 hours unsupervised. It's that 46% stays deliberate. That's the design working.
The system handles reflex-level work automatically. Deliberate work rises to conscious attention. That's the point.
Short interventions that saved hours:
This is the 5% that makes the 48% possible. Human pattern recognition catching AI drift, expressed as architectural principles rather than implementation details.
| Enabler | Not This |
|---|---|
| Task from a queue | Vague goal |
| Process docs read first | Improvisation |
| Review gates to pass | Optional checks |
| State to update | Black box |
| Standards enforced | YOLO mode |
| Tools report position | Agent tracks state |
The last row is the hidden unlock: stateful tools, stateless agents. Every tool response includes "step 5/12, do this next." The agent doesn't need to remember where it is—the tools tell it.
| Risk | Constraint |
|---|---|
| Agent pursues tangent | Task-scoped work from queue |
| Agent makes breaking changes | review_code gate before completion |
| Agent can't find answer | Escalation to thinking partner |
| Agent doesn't report status | Mandatory work_tracking updates |
| Agent accesses wrong systems | Restricted permission level |
| Agent's approach is wrong | review_plan gate before implementation |
"Structure enables autonomy. The 46% interactive is the control surface. The 54% runs because the boundaries are clear."
Fred Brooks said "Plan to throw one away." With AI, this becomes optimal strategy:
The cost of code is going to zero (in dollars). What remains valuable: interfaces, decisions, tests, and architecture.
AI makes the cheap part (code generation) nearly free. But 80% of software cost is maintenance—debugging, rework, technical debt. That's where quality gates pay off:
The 2,974 quality gate invocations aren't overhead—they're the mechanism that prevents $100 problems (production bugs, rework, debt). Speed without quality creates liability. Speed with quality creates value.
Kindergarten teachers don't control every movement. They set boundaries and give freedom within them.
Leadership shifts from micromanaging tasks to defining outcomes and constraints.
Once you're running multiple projects in parallel, a new question emerges: how do you add more?
| Bottleneck | Solution |
|---|---|
| Token limits per project | Separate API accounts, budget allocation |
| Human attention (5% steering) | Portfolio-style time-boxing, async check-ins |
| MCP per project overhead | Shared tooling, templated scaffolding |
| Context switching cost | Consistent patterns across projects |
| Knowledge silos | Cross-project knowledge base, shared decisions |
The new constraint: You can't steer infinite projects simultaneously. The 5% expertise amplification that enables each project still requires human attention. Scaling means optimizing attention allocation, not eliminating it.
The unlock: Shared infrastructure. When the components are templated, adding a new project means copying the boundaries—not reinventing them.
Appendix
It ships to production. Here's what that looks like.
"Does AI produce usable code?" isn't the right question. "Does your system ensure usable code?" is. The infrastructure is the answer.
This isn't just code volume. It's the artifact diversity of a full engineering organization:
| Role | Artifact Type | File Modifications |
|---|---|---|
| Backend + Frontend Dev | Application Code (.go, .ts, .js) | 64,867 |
| QA Engineer | Test Files (*_test.go, *.test.ts) | 3,038 |
| DevOps/SRE | Terraform, CI/CD, Dockerfiles | 1,956 |
| Tech Writer | Documentation (.md) | 2,891 |
| Architect | API Schemas, Protos, SQL | 504 |
| UI Developer | HTML, CSS, SCSS | 569 |
Total: 90,356 file modifications across 6 active projects.
Quality isn't one metric. It's a fully automated verification pipeline:
| Layer | What It Checks | How |
|---|---|---|
| Unit Tests | Code paths work | 59% coverage across 41 Go packages |
| E2E Tests | Deployed services work | Tests run against real infrastructure |
| Playwright | Browser flows work | Automated UI interaction tests |
| Visual Verification | UI matches design | Gemini compares screenshots to mocks |
All automated. All run by agents. The 165 releases passed through this entire pipeline.
| Indicator | Slop | This System |
|---|---|---|
| Verification | None | 4-layer automated pipeline (unit → E2E → Playwright → visual) |
| Deployment | Doesn't ship | 165 releases passed CI/CD |
| Planning | No design docs | 356 PRDs/specs created |
| Infrastructure | No infra | 1,956 Terraform/CI changes |
| Documentation | Missing | 2,891 markdown files |
Real planning artifacts produced during the study period:
DESIGN-PKCE-ENDPOINTS.md — OAuth PKCE implementation specPRD-PROJECT-MANAGEMENT-MCP.md — Full product requirements docTASK-913-EXPORT-SERVICE.md — Service design for export featureagent-supervision-framework.md — Agent orchestration architecturerelease-management.md — Release workflow documentationThis is the work of a product team, not a code generator.
Appendix
"Tokens = work getting done.
The more tokens, the more the AI is doing for me."
Appendix
Every statistic comes from actual Claude Code chat logs.
No surveys. No estimates. Parsed JSON from 97 days.
Claude Code stores full conversation transcripts in ~/.claude/projects/*/ as JSONL files. Each line is a timestamped message with type, content, and tool calls.
UUID.jsonl files (can exceed 1GB for long-running sessions)agent-*.jsonl files (1-50MB per agent){"type": "human"|"assistant"|"tool_use", "message": {...}, "timestamp": "..."}An arc is a period of autonomous work initiated by a single user prompt. We built a Python tool (arc_analyzer.py) that:
burn\s+down.*(?:release|R\d+|tasks) — Release burn downspawn.*(?:restricted|foreground).*agent — Explicit delegationplease.*review.*docs\/prompts.*spawn — Full delegation templateTask tool calls and their completionsif agents_spawned == 0:
if duration < 15 min and "review" in prompt:
type = "review"
else:
type = "interactive"
elif duration < 15 min:
type = "quick"
elif duration < 60 min:
type = "build"
elif "debug" in prompt:
type = "debug"
elif duration < 240 min:
type = "feature"
else:
type = "release"
| Claim | Verification |
|---|---|
| 650 arcs detected | arc_analyzer.py stats command output |
| 29 release arcs | arc_analyzer.py list --type release |
| 5% → 48% power law | 29/650 = 4.5%, 299 hrs / ~620 total hrs = 48% |
| v2.5 example | Manual inspection of session UUID 9081bc23-* |
| 543 autonomous hours | Sum of subagent session durations from timestamps |
The same methodology can be applied to your own Claude Code logs at ~/.claude/projects/:
# Extract arcs from your sessions python arc_analyzer.py extract # Show arc statistics python arc_analyzer.py stats # List release arcs only python arc_analyzer.py list --type release # Generate full report python arc_analyzer.py report
The tool and methodology are fully documented. The numbers hold up to scrutiny because they're measured from real logs, not modeled or estimated.