The Divide

"AI makes me 10x productive"
"AI produces garbage"

Both are true. The difference is technique.

I'm a top performer that achieves maximum productivity with autonomous coding agents. I worked with Claude Code to examine my actual logs to see how.

97 days. 543 autonomous hours. Here's the data.

14,926 prompts  ·  2,314 agent sessions  ·  165 shipped releases

The Online Discourse

Online forums have always been rife with arguments springing from strong opinions, and AI coding productivity is no exception. The conversations follow the same basic pattern:

  • Claim: "I shipped a feature in 2 hours that would have taken 2 days"
  • Response: "You're lying, or you don't know what good code looks like"
  • Outcome: No resolution, because neither side shows receipts
The conversation reaches a stalemate. The opinions come from different lived experiences. People are convinced because they have learned from their own personal observations.

The Gap Is Real

The different perceptions are based in individual truth. High performers aren't lying. Skeptics aren't wrong. The gap exists because:

  • High performers have built infrastructure that makes AI effective
  • Skeptics are using AI like autocomplete—raw prompting without scaffolding
  • The techniques that bridge this gap aren't documented
Different outcomes are the result of the leverage. Maximum value comes from the LLMs when they are in a harness that allows autonomous but consistent output. Driven by an environment where a human sets up the tooling and guardrails, they magnify individual productivity.

What This Presentation Offers

Those are big words, but the real proof is in the data. This is a data-driven, high-level view of how to get the work done. By examining actual session logs, patterns emerge that show typical work cycles. This includes the setup, the tools called, and the guardrails that allow it to run without constant supervision.

Finally, it gives you the concrete steps to replicate this yourself.

About the Data

Everything here comes from one developer's Claude Code logs:

MetricValue
Date rangeOct 2, 2025 – Jan 2026 (97 days)
Total prompts14,926
Autonomous agent sessions2,314
Autonomous hours543
Concurrent workstreams6
Shipped releases165
Monthly cost~$500

The practitioner has 35 years of professional SaaS and software engineering experience. The processes encode engineering management best practices.

What the Data Shows

Seven patterns.
One power law.

650 work arcs clustered into distinct types. 5% of arcs produce 48% of autonomous hours.

Pattern% Arcs% HoursAvg Duration
Release4.5%48%10.3 hours
Feature11.8%23%112 min
Build14.5%8%33 min
Review24.9%10%23 min
Interactive20.9%12%33 min
Quick22%2%5 min
Debug1.4%3%118 min
The leverage is in the long arcs. The short ones enable them.

The Three Tiers

The logs show that there are "arcs" of productivity, where a group of related prompts has a natural beginning, middle, and end. The arcs naturally cluster into three tiers by autonomy level:

Tier 1: Steering (46% of arcs, 22% of hours)

Human-in-the-loop collaboration. No agents spawned.

  • Review: Human-driven code/design review sessions
  • Interactive: Direct conversation, Q&A, exploration

This is where decisions happen and human input is most valuable. The human uses the LLM as a thought partner to explore architectural choices, examine debugging hypotheses, and define the scope of work.

Tier 2: Momentum (37% of arcs, 10% of hours)

Short autonomous bursts for routine tasks.

  • Quick: Fast single tasks (avg 5 min, 2.7 agents)
  • Build: Test/build/deploy cycles (avg 33 min, 4.4 agents)

Work here is about breaking through barriers to keep work moving. The LLM is an assistant that runs tests, fixes linting, and deploys changes.

Tier 3: Value Delivery (18% of arcs, 68% of hours)

Extended autonomous execution. This is where output happens.

  • Feature: Multi-task implementation (avg 112 min, 9.4 agents)
  • Release: Full release burn-down (avg 10.3 hours, 8.7 agents)
  • Debug: Deep investigation cycles (avg 118 min)

The Power Law

The distribution is not uniform:

5% of arcs (release)
48% of autonomous hours
29 release arcs total
299 hours from releases

Pattern Definitions

PatternTriggerWhat Happens
Release "burn down tasks in Release X" Orchestrator reads task graph, spawns waves of agents, monitors to completion
Feature "implement X" (multi-task) 2-5 agents work through related tasks over 1-4 hours
Build "run tests" / "fix the build" Iterative fix-test-fix cycles until green
Review "review this code" Human-guided review, AI executes checks
Interactive Discussion, questions Back-and-forth exploration, no agents
Quick Single small task Fast execution, minimal coordination
Debug "investigate X" / "fix this bug" Hypothesis testing, trace analysis, systematic investigation

Key Insight

The short arcs (steering, momentum) create the conditions for long arcs (value delivery). You can't skip to release arcs—you need the planning, review, and debugging cycles to set them up.

The Prompt Split

42% templates.
58% steering.

Templates enable autonomy. But human judgment still drives the majority of interaction.

42% Structured
58% Adaptive
"The templates handle the routine. You handle the decisions."

What Gets Templated (42%)

These are the repeatable commands that trigger autonomous work:

PatternCountExample
Release planning165"Read the process docs, create Release X with tasks that have verifiable acceptance criteria"
Task delegation403"burn down tasks in Release X"
Confirmations376"Yes please"
Review requests324"run review_code on Release X"
Build/test checks358"Does it build? Do tests pass?"
Deploy commands215"commit and push"

Note the ratio: 403 delegation prompts ÷ 165 releases = ~2.4x. Each release is planned once, but execution spans multiple sessions. Context fills up, you step away, you come back—each restart requires re-delegating. This is the natural rhythm: stable planning units, chunked execution.

Why Templates Work

Templates aren't just shortcuts—they work because enforced structure gives them consistent fields to reference:

  • "burn down tasks in Release X" → Release has tasks with status, dependencies
  • "run review_code on Release X" → Tasks have acceptance_criteria to verify against
  • "Does it build?" → Process doc defines what "build" means

The same prompt triggers the same workflow because the underlying data has the same structure.

Document loading Enables Consistent Decomposition

The "release planning" template is the most powerful example. The process docs exposed as MCP resources define:

  • Iterative approach: build a minimal feature set first, then iterate
  • Task progression: how work flows through design → implementation → review
  • Acceptance criteria format: objective verifiers that LLM agents can check

When you say "read the process docs then create the release," the LLM loads all this into the context window. The result: 165 releases decomposed exactly the same way—iterative structure, proper task dependencies, verifiable acceptance criteria.

The docs aren't just documentation. They're the playbook that context priming injects into every planning session.

What Stays Adaptive (58%)

The "noise" in clustering analysis—prompts that don't match templates—represents human steering:

CategoryExamples
Specific instructions"fix the test", "fix the high priority ones"
Context questions"What is in R6?", "Is it in the openapi spec?"
Bug investigation"I see this error, where is the host defined"
Architecture decisions"For the MCP, everything goes via the BFF"
Progress tracking"Please mark 482 as completed"
Clarifications"It's not claude desktop its claude code"

The Key Insight

The 58% "adaptive" prompts aren't failure to templatize—they're the human-in-the-loop providing:

  • Real-time steering of autonomous work
  • Context that templates can't capture
  • Business logic decisions
  • Error investigation guidance
  • Cross-session coordination

Templates create a foundation for consistent execution. Steering ensures the execution produces value.

The Release Cycle

One prompt.
13 hours later: shipped.

Analyze
40 min
Design
40 min
Implement
10+ hours
Ship
2 hours
10 agents spawned
4 waves
47 tasks completed
66K tool calls
"Kicked it off. Went to bed. Woke up to a deployable release."

The Prompt

Please review the docs/prompts exposed as resources by the project MCP
to understand the process then spawn appropriate restricted agents
in the foreground to burn down the tasks in [Release v2.5].

47 words. But this prompt only works because it is built on a foundation that ensures consistency. The "docs/prompts" define how to examine the release, how to determine the needed agents, and how to spawn "restricted" agents.

The docs also give the guideline for the agent prompts themselves. They are given a specific workflow to follow that constrains their work to the task requirements.

Context Priming: The Critical First Step

The orchestrator's first action was to read the exposed resources: the process docs, workflow definitions, and project standards. This primes the context window with exactly the same knowledge every time.

Why this matters: The AI doesn't "remember" from last time. Every session starts fresh. Consistent results come from consistent context, not from learning. By loading the same resources first, every run operates from the same baseline.

Then: Analyze and Execute

With context primed, the orchestrator analyzed the work:

"Current v2.5 Status: 60 total tasks. 40 tasks need design (must go through design → review before implementation). 7 tasks in todo (can be claimed directly)."

Then executed in waves:

TimeProgressDecisionAgents
+0h0%Spawn design + frontend agents (unblock the chain)2
+0.6h18%Design done → spawn implementation agents3
+11h58%Core complete → spawn remaining work3
+12.3h68%Almost done → final frontend + tests2
+12.9h100%"v2.5 - COMPLETE!"

What Actually Shipped

Backend (Go)
  • Domain state machine
  • New data model + repository
  • Core service with business logic
  • Pub/Sub event pipeline
Frontend + Infra
  • TypeScript interfaces + components
  • Terraform configs
  • Cloud Build pipeline
  • Integration tests

Not a Fluke: 29 Release Arcs

This wasn't a lucky one-off. Over 97 days, the same pattern ran 29 times:

  • Average duration: 10.3 hours
  • Range: 4.1 hours to 17.3 hours
  • Total: 299 of 543 autonomous hours (55%)
  • Same prompt template every time
  • Same resources loaded first every time

Stable context. Stable input. Stable output.

The Discovery Cycle

Before you can automate,
you have to explore.

46% of arcs are interactive—no agents spawned. This is where decisions happen.

298 review arcs
136 interactive arcs
137 hours total
0 agents spawned
"The release cycle only works if you've done discovery first."

The core Human-Provided Value

LLMs have evolved into enormously powerful engines that can produce vast amounts of artifacts. However, they need to be guided to produce output that is actually valuable. Without this, they can produce a mountain of garbage just as easily as gold.

Interactive arcs are back-and-forth collaboration where the human uses the LLM as a thought partner to define both the final goal and the best path to get there. This guidance is arguably the most important part, because this is the fuel that powers the AI.

This conversation snippet shows how a typical session evolves. After a conversation about the goals and implementation approach, we see if Claude is ready to implement:

Human: Do you have enough context in the conversation to understand the final destination of this implementation journey? Please describe it to me.

Claude: Based on what I have in context, I can describe the architectural foundation but I'm missing the complete product vision.

What I understand: [Claude correctly describes the architecture]

This shows how the person checks for understanding in the context. Claude is clear that it doesn't know enough to complete the task. Claude wants to be helpful, so if we launched it on an implementation are at this point it would try its best, but we have no guarantee the final product is going to be useful.

However, if we ensure it has what it needs before launching an autonomous arc, the product will be what we expect. This is manual context priming.

Other examples of work done in this phase include:

  • Architectural exploration: "How should we structure this service?"
  • Research: "What's the best way to handle X in golang?"
  • Debugging hypotheses: "I think the bug is in the auth flow"
  • Scope definition: "What should v2.5 include?"
  • Trade-off analysis: "Should we use asynchronous events or direct calls?"

The Four Phases

Every major deliverable follows this sequence:

PhaseModeWhat Happens
1. Exploration Interactive Discover what to build. Back-and-forth discussion, research, prototyping.
2. Planning Guided Create tasks, capture dependencies, structure the release. AI drafts according to a defined approach. Run review_plan then work collaboratively to refine the plan.
3. Implementation Autonomous Orchestrator reads the plan and executes. This is the "13 hours" part.
4. Review Autonomous loop Run review_code, fix issues, repeat until clean.

Exploration Prompts

These are the 58% "adaptive" prompts that can't be templated:

"What's the simplest way to add event sourcing here?"

"I'm seeing timeouts on the cloud function.
Where should I start investigating?"

"For v2.5, we need trend analysis. What data
do we already have that we could use?"

These conversations help the user refine their ideas at the outset, bringing clarity in both what they want and how to do it. But additionally, they start laying the foundation in the context. The LLM sees both the final request and the logic the user followed, allowing it to follow that lead and fill in the details as it works.

review_plan in Action

The planning phase isn't just "write tasks and go." The review_plan tool validates release plans to ensure they follow the established process.  It also checks for correctness before implementation begins. Here's an example of what it found on Release R15 (Size-Aware Timeout API):

Legitimate Issues Found

Verdict: FAIL

"The core data aggregation logic in BigQuery (Task 433) is fundamentally flawed as it requires [specific data redacted] but lacks any mechanism to access that data within the data warehouse. Without a [redacted data source] in BigQuery, the 'Size-Aware' classification is impossible to implement."

This was a critical architectural blocker—the plan looked complete but couldn't actually work. Claude revised the plan to add data ingestion before re-running review_plan.

Severity: Critical

"A critical contradiction exists in Task 461 regarding the data ingestion strategy (inline constants vs. BigQuery table) which must be resolved to align with dependent tasks."

A logical inconsistency—two tasks made incompatible assumptions about how data would be stored.

Escalated to Human Judgment

"The primary operational risks involve the manual maintenance of the BigQuery reference table (Task 461) and the potential for deployment timeouts if BigQuery is unavailable during cache hydration."

This couldn't be "fixed" by code—it required an operational policy decision about who maintains the reference table and how often. Human input needed. In this case, there was a significant architectural issue that had to be addressed at the highest architectural level, resulting in a dramatically revised plan.

These examples show the importance of collaboration: some issues could be automatically fixed, some required guidance on the correct approach, and some required full systems thinking.

review_plan ran 7 times on this release. Each iteration found issues, Claude revised, and re-ran until the plan passed. This iterative refinement happens before any implementation begins.

Why Discovery Can't Be Skipped

The autonomous implementation cycle reads from a database with:

  • Task definitions with acceptance criteria
  • Dependency chains (what blocks what)
  • Scope boundaries (what's in this release vs. future)

All of that gets created during discovery and planning. The autonomous execution is return on that up-front investment.

Safe Autonomy

Guardrails enable speed.
Not the other way around.

The counterintuitive truth: more structure creates more autonomy. Review gates catch errors at $1, not $100.

2,974 quality gate checks
1,687 design reviews
867 code reviews
420 plan reviews

The pattern: Claude proposes → Gemini validates → Fix or proceed

Why Guardrails Enable Autonomy

We've learned to watch the LLM work because we've seen it go off track and create thousands of lines of code that are entirely the wrong thing. But, when we add guardrails to the harness, the LLM becomes self-correcting. As it drifts off course, it gets nudged back.

While this comes with an up-front cost, it ends up saving time and money in the end.

  • Bad idea at design phase: ~$1 (a few API calls to reject it)
  • Bad idea in production: ~$100+ (debugging, rollback, customer impact)

The 2,974 quality checks ensure that not only is the LLM productive, but that it is correct.

Agentic Reviews, Not Static Prompts

Each review tool is an agent with database access. It queries the task DB for acceptance criteria, pulls release details, and examines the actual files. It doesn't just check "is this good code"—it verifies "does this code do what the task said it should do."

This only works because enforced structure guarantees every task HAS acceptance_criteria. No structure → nothing to verify against.

The Three Review Gates

GateWhenWhat It VerifiesCalls
review_plan Before implementation Release scope aligns with project goals, no gaps 420
review_design After design, before code Design satisfies task acceptance criteria and is not over-engineered 1,687
review_code After implementation Code implements what the task says + quality checks 867

Multi-Model Verification

Claude typically acts as the orchestrator and implementor. However, these agentic tools do not use Claude, they use Gemini. The key insight: Claude proposes, Gemini validates.

  • Different model, different biases
  • Gemini reviews against documented standards (not just "does it look right")
  • Creates an adversarial dynamic that catches more errors
  • Both models get the same standards loaded first—consistent context enables consistent judgment

The Review Loop (With Triage)

Please run the review_code tool on the artifacts from Release R17
and fix any issues it reports. Repeat until nothing valid is
reported, even suggestions. If anything needs input from me,
first check the knowledge base to see if I already provided
guidance, but if not stop and tell me what I need to decide
with three options, then record my decisions in the knowledge base.

The key: intelligent triage. The reviewer doesn't dump everything on you:

Issue TypeActionHuman Needed?
Obviously fixableAuto-fix immediatelyNo
Needs judgmentSemantic search knowledge baseMaybe
New decision neededEscalate with 3 optionsYes (once)

Example query: "file storage links widget UUID lookup pattern, structured logging conventions" → finds relevant prior decisions or escalates with structured options.

The benefit of this is that the human doesn't spend time trying to research each escalated issue. The options presented by the AI typically give enough context. They are also often correct, allowing the user to simply choose from a menu to proceed. The rare time full attention is needed is the place where it actually does make a difference.

Human time goes to decisions that matter. Decisions get recorded—so the same question auto-resolves next time.

What Makes It Safe

  • Structured task database: Work is defined with acceptance criteria before execution
  • Dependency chains: Tasks can't start until prerequisites are complete
  • Review gates: Each phase validates before the next begins
  • Restricted agents: Sandboxed execution with clear scope
  • Progress tracking: Orchestrator monitors and adapts

The Economics

Those 2,974 review calls cost roughly $50-100 total over 97 days. The bugs they caught would have cost far more to fix in production.

Quality gates are cheap. Production bugs are expensive. The math is obvious once you run it.

The Human Role

You provide judgment.
AI provides execution.

The autonomy pyramid: small fraction of arcs, majority of hours.

Value Delivery
18% arcs → 68% hours
Momentum
37% arcs → 10% hours
Steering
46% arcs → 22% hours
"Most of your time is steering. Most of the output comes from value delivery."

The Three Tiers

Tier% of Arcs% of HoursYour Role
Steering 46% 22% Make decisions, review, course-correct
Momentum 37% 10% Kick off routine tasks, verify completion
Value Delivery 18% 68% Set up the work, walk away

What You Actually Do

Your time goes to high-judgment activities:

  • Architecture decisions: What to build, how to structure it
  • Scope definition: What's in, what's out, what's deferred
  • Quality judgment: Is this good enough? What's missing?
  • Priority calls: What matters most right now?
  • Error interpretation: What does this failure mean?

What AI Does

AI handles the execution-heavy work:

  • Implementation: Writing the actual code
  • Testing: Creating and running tests
  • Iteration: Fix-test-fix cycles until green
  • Documentation: Writing docs that match the code
  • Coordination: Managing task dependencies

The Leverage

The pyramid shows why this scales:

  • You spend 46% of arcs on steering (decisions)
  • But those decisions shape 68% of the autonomous hours
  • Your judgment gets multiplied, not replaced

Not a Junior Engineer

The fluency of the AI makes it easy to think you should interact with it like you would a junior engineer. The best output comes from realizing that it is a machine that produces code and executes tasks, and you should treat it as such.

It is:

  • An execution layer that follows structured plans
  • A verification system that catches its own errors
  • A force multiplier for your expertise

You don't teach it to code. Instead you create the scaffolding to load the machine, then let it run.

The Economics

$500/month.
543 autonomous hours.

That's $0.92 per hour of AI execution. The leverage is in the volume.

~$500 monthly cost
543 autonomous hours
165 shipped releases
$0.92 per hour
"Tokens aren't a cost to minimize. They're work getting done."

The Math

MetricValueNotes
Study period97 daysOct 2025 – Jan 2026
Total cost~$1,600Two Claude Max+ subscriptions, Gemini API calls
Monthly average~$500Varies by activity level
Autonomous hours543Agent execution time
Cost per hour$0.92$500 / 543 hours

The Mindset Shift

Many practitioners try to minimize token usage. This is backwards.

  • Old thinking: "Tokens cost money, use fewer"
  • New thinking: "Tokens are work happening, use more"

Every token spent on review_code is a bug caught early. Every token spent on test generation is coverage you didn't write manually.

Where the Money Goes

Activity% of TokensValue
Implementation~60%Code that ships
Review/validation~20%Bugs caught early
Exploration~15%Decisions made
Overhead~5%Coordination, retries

The Real ROI

The 165 shipped releases included:

  • Backend services, APIs, data pipelines
  • Frontend components, UI features
  • Infrastructure (Terraform, CI/CD)
  • Tests, documentation

One person. 165 releases. $500/month.

Failure Modes

What goes wrong.
How to recover.

1,481 issues caught across 97 days. All contained by guardrails.

FailureCountRecovery
Code quality issues 867 review_code fix loop
Blocked dependencies 237 Strategic sequencing
Wrong direction 211 Human correction
Design rejections 70 Revision → re-review
Stuck debugging 61 Escalate or pivot
Scope creep 35 Agent self-corrects
"Guardrails don't prevent failures. They contain them."

Code Quality Issues (867 review_code cycles)

What happened: The agents implemented the code and marked the tasks complete. The reviewer found quality issues—bugs, missing error handling, style violations, test gaps.

Recovery: review_code runs, agent fixes issues, re-runs until clean. The loop is automatic:

  1. review_code flags issues
  2. Agent auto-fixes obvious ones
  3. Escalates judgment calls to knowledge base
  4. Re-runs until nothing reported

867 cycles × issues per cycle = thousands of quality fixes that never reached production.

Blocked Dependencies (237 instances)

What happened: Tasks couldn't proceed because prerequisites weren't complete. In R5.9, 40 of 60 tasks were blocked waiting for design approval.

Recovery: Orchestrator recognized the pattern and changed strategy—spawned design agents first to unblock the chain. Progress moved in waves: 0% → 18% → 58% → 68% → 100%.

Wrong Direction (211 instances)

What happened: Agent focused on wrong signal, wrong file, or wrong approach. Example from logs: investigating local code when the real issue was deployment state.

Recovery: Human provides correction with context. Agent acknowledges and refocuses immediately. Average recovery: one prompt.

"You're right - I was looking at the wrong signal entirely."

Design Rejections (70 instances)

What happened: review_design returned NEEDS_REVISION—incomplete edge cases, architecture misalignment, missing acceptance criteria.

Recovery: Automatic status transition: design_ready → needs_design. Agent revises, re-stores, re-reviews. 100% eventually passed.

Notable Insight: The workflow specifically prevent the implementing LLM from transitioning the task to "in progress" until the reviewer approves the design.

Stuck Debugging (61 instances)

What happened: Agent encountered persistent errors that wouldn't resolve. Same fix attempted repeatedly.

Recovery: Escalation to different model or human provides missing context. The 6h 42m agent hit this—discovered feature wasn't deployed yet, adapted script accordingly.

Scope Creep (35 instances)

What happened: Agent recognized work was outside task boundaries as defined in the release description.

Recovery: Agent self-corrected without human intervention:

"This is out of scope for R1 - I'll focus on the AUTO_FIX item."

The Three Recovery Tiers

TierMechanismHuman Needed?
Tier 1Build/test/fix loopsNo
Tier 2Agent self-detection (scope, blockers)No
Tier 3Human correctionYes (~15%)

~85% of failures resolved without human intervention. The infrastructure handles it.

The Key Insight

614 failures over 97 days. Zero stopped progress entirely. The system is designed so failures are:

  • Detected early (review gates at each phase)
  • Contained (restricted agents, scoped tasks)
  • Recoverable (automatic rollback, human escalation path)

Build Your Own

The patterns are replicable.
Here's how to start.

The infrastructure that enables these patterns isn't magic. It's local tooling that encodes your workflow knowledge.

ComponentPurposeStart Simple
Task List Structured work queue Markdown checklist or PLAN.md
Process Docs Context priming Markdown files in repo
Review Gates Quality checkpoints Shell script + curl + Gemini API
Knowledge Base Capture decisions Markdown file of rulings
"The workflow lives in the resources, not in the prompt."

The Core Insight: Structure Enables Templates

A 47-word prompt triggered 13 hours of coherent work because enforced structure enables template prompts.

The review script requires every task to have specific fields:

  • id, status, dependencies, acceptance_criteria

Because every task has those fields, you can write prompts that reference them:

  • "Implement task #X according to its acceptance_criteria"
  • "Check if dependencies are complete before starting"
  • "Verify the code does what the task says"

No enforced structure → no consistent fields → no template prompts → no automation.

The knowledge base adds learning: when the agent hits a decision point, it checks for prior guidance first. Your rulings accumulate—the system learns without the AI remembering.

Start Monday: Week 1

  1. Create a PLAN.md: Task checklist with acceptance criteria in markdown
  2. Write one process doc: "How we implement features" in markdown
  3. Reference both in prompts: "Read PLAN.md and process.md, then implement task #1"
  4. That's it. No database, no MCP—just markdown files.

Week 2: Add Review Gates

  1. Write a shell script: ~100 lines of bash + curl + jq
  2. Embed your standards: Put review criteria directly in the prompt
  3. Call Gemini API: Send Claude's output for independent review
  4. Run it manually: ./review-plan.sh PLAN.md

Week 3: Build the Loop

  1. Add status tracking: todo → in_progress → review → done
  2. Add dependency checks: Can't start until dependencies are done
  3. Create orchestration prompt: "Check task database, spawn agents for ready tasks"
  4. Test a multi-task run: Queue 3-5 tasks, let it execute

Week 4: Add the Knowledge Base

  1. Create a rulings file: decisions.md with categories (code style, architecture, etc.)
  2. Update review prompt: "If you need a decision, check knowledge base first"
  3. Record new decisions: When you answer a question, add it to the file
  4. Watch it compound: Fewer questions over time as guidance accumulates

The Evolution

The infrastructure in this study evolved over 5 months:

  • June 2025: Shell scripts + markdown files (~700 LOC total)
  • August 2025: MCP server in Go + SQLite (same philosophy, better tooling)
  • August–January: 525 commits, 2 → 60 tools, single → multi-model

You don't need 60 tools to start. You need PLAN.md and a shell script that calls Gemini.

What You're Actually Building

This isn't about AI replacing you. It's about building scaffolding that lets AI magnify your judgment:

  • Your decisions get encoded in process docs
  • Your standards get enforced by review gates
  • Your workflow gets executed at scale

The AI does the typing. You provide the expertise.

This Scales to Teams

This study shows one person's results, but the patterns are team infrastructure. Process docs, review gates, and knowledge bases don't belong to an individual—they belong to a codebase. Once they exist, every engineer on the team benefits:

  • Shared process docs mean any team member can delegate the same way
  • Shared review gates enforce team standards automatically
  • Shared knowledge base captures decisions once, applies them everywhere

The force multiplication compounds. One person with these patterns produced 543 autonomous hours. A team of five, with shared infrastructure, doesn't get 5× — they get the compounding effect of shared context, shared standards, and shared learning.

The Payoff

543 autonomous hours
165 shipped releases
~$500 monthly cost
1 person

Analyze Your Own Logs

The analysis tools used in this research are open source: github.com/mrothroc/claude-code-log-analyzer

Measure autonomous work hours, detect work arcs, and cluster prompts from your Claude Code logs at ~/.claude/projects/.