The Divide

"AI makes me 10x productive"
"AI produces garbage"

Both are true. The difference is technique.

I'm a top performer that achieves maximum productivity with autonomous coding agents. I worked with Claude Code to examine my actual logs to see how.

97 days. 543 autonomous hours. Here's the data.

14,926 prompts · 2,314 agent sessions · 165 shipped releases

The Online Discourse

Online forums have always been rife with arguments springing from strong opinions, and AI coding productivity is no exception. The conversations follow the same basic pattern:

Claim: "I shipped a feature in 2 hours that would have taken 2 days"
Response: "You're lying, or you don't know what good code looks like"
Outcome: No resolution, because neither side shows receipts

The conversation reaches a stalemate. The opinions come from different lived experiences. People are convinced because they have learned from their own personal observations.

The Gap Is Real

The different perceptions are based in individual truth. High performers aren't lying. Skeptics aren't wrong. The gap exists because:

High performers have built infrastructure that makes AI effective
Skeptics are using AI like autocomplete—raw prompting without scaffolding
The techniques that bridge this gap aren't documented

Different outcomes are the result of the leverage. Maximum value comes from the LLMs when they are in a harness that allows autonomous but consistent output. Driven by an environment where a human sets up the tooling and guardrails, they magnify individual productivity.

What This Presentation Offers

Those are big words, but the real proof is in the data. This is a data-driven, high-level view of how to get the work done. By examining actual session logs, patterns emerge that show typical work cycles. This includes the setup, the tools called, and the guardrails that allow it to run without constant supervision.

Finally, it gives you the concrete steps to replicate this yourself.

About the Data

Everything here comes from one developer's Claude Code logs:

Metric	Value
Date range	Oct 2, 2025 – Jan 2026 (97 days)
Total prompts	14,926
Autonomous agent sessions	2,314
Autonomous hours	543
Concurrent workstreams	6
Shipped releases	165
Monthly cost	~$500

The practitioner has 35 years of professional SaaS and software engineering experience. The processes encode engineering management best practices.

What the Data Shows

Seven patterns.
One power law.

650 work arcs clustered into distinct types. 5% of arcs produce 48% of autonomous hours.

Pattern	% Arcs	% Hours	Avg Duration
Release	4.5%	48%	10.3 hours
Feature	11.8%	23%	112 min
Build	14.5%	8%	33 min
Review	24.9%	10%	23 min
Interactive	20.9%	12%	33 min
Quick	22%	2%	5 min
Debug	1.4%	3%	118 min

The leverage is in the long arcs. The short ones enable them.

The Three Tiers

The logs show that there are "arcs" of productivity, where a group of related prompts has a natural beginning, middle, and end. The arcs naturally cluster into three tiers by autonomy level:

Tier 1: Steering (46% of arcs, 22% of hours)

Human-in-the-loop collaboration. No agents spawned.

Review: Human-driven code/design review sessions
Interactive: Direct conversation, Q&A, exploration

This is where decisions happen and human input is most valuable. The human uses the LLM as a thought partner to explore architectural choices, examine debugging hypotheses, and define the scope of work.

Tier 2: Momentum (37% of arcs, 10% of hours)

Short autonomous bursts for routine tasks.

Quick: Fast single tasks (avg 5 min, 2.7 agents)
Build: Test/build/deploy cycles (avg 33 min, 4.4 agents)

Work here is about breaking through barriers to keep work moving. The LLM is an assistant that runs tests, fixes linting, and deploys changes.

Tier 3: Value Delivery (18% of arcs, 68% of hours)

Extended autonomous execution. This is where output happens.

Feature: Multi-task implementation (avg 112 min, 9.4 agents)
Release: Full release burn-down (avg 10.3 hours, 8.7 agents)
Debug: Deep investigation cycles (avg 118 min)

The Power Law

The distribution is not uniform:

5% of arcs (release)

48% of autonomous hours

29 release arcs total

299 hours from releases

Pattern Definitions

Pattern	Trigger	What Happens
Release	"burn down tasks in Release X"	Orchestrator reads task graph, spawns waves of agents, monitors to completion
Feature	"implement X" (multi-task)	2-5 agents work through related tasks over 1-4 hours
Build	"run tests" / "fix the build"	Iterative fix-test-fix cycles until green
Review	"review this code"	Human-guided review, AI executes checks
Interactive	Discussion, questions	Back-and-forth exploration, no agents
Quick	Single small task	Fast execution, minimal coordination
Debug	"investigate X" / "fix this bug"	Hypothesis testing, trace analysis, systematic investigation

Key Insight

The short arcs (steering, momentum) create the conditions for long arcs (value delivery). You can't skip to release arcs—you need the planning, review, and debugging cycles to set them up.

The Prompt Split

42% templates.
58% steering.

Templates enable autonomy. But human judgment still drives the majority of interaction.

42% Structured
58% Adaptive

"The templates handle the routine. You handle the decisions."

What Gets Templated (42%)

These are the repeatable commands that trigger autonomous work:

Pattern	Count	Example
Release planning	165	"Read the process docs, create Release X with tasks that have verifiable acceptance criteria"
Task delegation	403	"burn down tasks in Release X"
Confirmations	376	"Yes please"
Review requests	324	"run review_code on Release X"
Build/test checks	358	"Does it build? Do tests pass?"
Deploy commands	215	"commit and push"

Note the ratio: 403 delegation prompts ÷ 165 releases = ~2.4x. Each release is planned once, but execution spans multiple sessions. Context fills up, you step away, you come back—each restart requires re-delegating. This is the natural rhythm: stable planning units, chunked execution.

Why Templates Work

Templates aren't just shortcuts—they work because enforced structure gives them consistent fields to reference:

"burn down tasks in Release X" → Release has tasks with status, dependencies
"run review_code on Release X" → Tasks have acceptance_criteria to verify against
"Does it build?" → Process doc defines what "build" means

The same prompt triggers the same workflow because the underlying data has the same structure.

Document loading Enables Consistent Decomposition

The "release planning" template is the most powerful example. The process docs exposed as MCP resources define:

Iterative approach: build a minimal feature set first, then iterate
Task progression: how work flows through design → implementation → review
Acceptance criteria format: objective verifiers that LLM agents can check

When you say "read the process docs then create the release," the LLM loads all this into the context window. The result: 165 releases decomposed exactly the same way—iterative structure, proper task dependencies, verifiable acceptance criteria.

The docs aren't just documentation. They're the playbook that context priming injects into every planning session.

What Stays Adaptive (58%)

The "noise" in clustering analysis—prompts that don't match templates—represents human steering:

Category	Examples
Specific instructions	"fix the test", "fix the high priority ones"
Context questions	"What is in R6?", "Is it in the openapi spec?"
Bug investigation	"I see this error, where is the host defined"
Architecture decisions	"For the MCP, everything goes via the BFF"
Progress tracking	"Please mark 482 as completed"
Clarifications	"It's not claude desktop its claude code"

The Key Insight

The 58% "adaptive" prompts aren't failure to templatize—they're the human-in-the-loop providing:

Real-time steering of autonomous work
Context that templates can't capture
Business logic decisions
Error investigation guidance
Cross-session coordination

Templates create a foundation for consistent execution. Steering ensures the execution produces value.

The Release Cycle

One prompt.
13 hours later: shipped.

Analyze

40 min

→

Design

40 min

→

Implement

10+ hours

→

Ship

2 hours

10 agents spawned

4 waves

47 tasks completed

66K tool calls

"Kicked it off. Went to bed. Woke up to a deployable release."

The Prompt

Please review the docs/prompts exposed as resources by the project MCP
to understand the process then spawn appropriate restricted agents
in the foreground to burn down the tasks in [Release v2.5].

47 words. But this prompt only works because it is built on a foundation that ensures consistency. The "docs/prompts" define how to examine the release, how to determine the needed agents, and how to spawn "restricted" agents.

The docs also give the guideline for the agent prompts themselves. They are given a specific workflow to follow that constrains their work to the task requirements.

Context Priming: The Critical First Step

The orchestrator's first action was to read the exposed resources: the process docs, workflow definitions, and project standards. This primes the context window with exactly the same knowledge every time.

Why this matters: The AI doesn't "remember" from last time. Every session starts fresh. Consistent results come from consistent context, not from learning. By loading the same resources first, every run operates from the same baseline.

Then: Analyze and Execute

With context primed, the orchestrator analyzed the work:

"Current v2.5 Status: 60 total tasks. 40 tasks need design (must go through design → review before implementation). 7 tasks in todo (can be claimed directly)."

Then executed in waves:

Time	Progress	Decision	Agents
+0h	0%	Spawn design + frontend agents (unblock the chain)	2
+0.6h	18%	Design done → spawn implementation agents	3
+11h	58%	Core complete → spawn remaining work	3
+12.3h	68%	Almost done → final frontend + tests	2
+12.9h	100%	"v2.5 - COMPLETE!"	—

What Actually Shipped

Backend (Go)

Domain state machine
New data model + repository
Core service with business logic
Pub/Sub event pipeline

Frontend + Infra

TypeScript interfaces + components
Terraform configs
Cloud Build pipeline
Integration tests

Not a Fluke: 29 Release Arcs

This wasn't a lucky one-off. Over 97 days, the same pattern ran 29 times:

Average duration: 10.3 hours
Range: 4.1 hours to 17.3 hours
Total: 299 of 543 autonomous hours (55%)
Same prompt template every time
Same resources loaded first every time

Stable context. Stable input. Stable output.

The Discovery Cycle

Before you can automate,
you have to explore.

46% of arcs are interactive—no agents spawned. This is where decisions happen.

298 review arcs

136 interactive arcs

137 hours total

0 agents spawned

"The release cycle only works if you've done discovery first."

The core Human-Provided Value

LLMs have evolved into enormously powerful engines that can produce vast amounts of artifacts. However, they need to be guided to produce output that is actually valuable. Without this, they can produce a mountain of garbage just as easily as gold.

Interactive arcs are back-and-forth collaboration where the human uses the LLM as a thought partner to define both the final goal and the best path to get there. This guidance is arguably the most important part, because this is the fuel that powers the AI.

This conversation snippet shows how a typical session evolves. After a conversation about the goals and implementation approach, we see if Claude is ready to implement:

Human: Do you have enough context in the conversation to understand the final destination of this implementation journey? Please describe it to me.

Claude: Based on what I have in context, I can describe the architectural foundation but I'm missing the complete product vision.

What I understand: [Claude correctly describes the architecture]

This shows how the person checks for understanding in the context. Claude is clear that it doesn't know enough to complete the task. Claude wants to be helpful, so if we launched it on an implementation are at this point it would try its best, but we have no guarantee the final product is going to be useful.

However, if we ensure it has what it needs before launching an autonomous arc, the product will be what we expect. This is manual context priming.

Other examples of work done in this phase include:

Architectural exploration: "How should we structure this service?"
Research: "What's the best way to handle X in golang?"
Debugging hypotheses: "I think the bug is in the auth flow"
Scope definition: "What should v2.5 include?"
Trade-off analysis: "Should we use asynchronous events or direct calls?"

The Four Phases

Every major deliverable follows this sequence:

Phase	Mode	What Happens
1. Exploration	Interactive	Discover what to build. Back-and-forth discussion, research, prototyping.
2. Planning	Guided	Create tasks, capture dependencies, structure the release. AI drafts according to a defined approach. Run review_plan then work collaboratively to refine the plan.
3. Implementation	Autonomous	Orchestrator reads the plan and executes. This is the "13 hours" part.
4. Review	Autonomous loop	Run review_code, fix issues, repeat until clean.

Exploration Prompts

These are the 58% "adaptive" prompts that can't be templated:

"What's the simplest way to add event sourcing here?"

"I'm seeing timeouts on the cloud function.
Where should I start investigating?"

"For v2.5, we need trend analysis. What data
do we already have that we could use?"

These conversations help the user refine their ideas at the outset, bringing clarity in both what they want and how to do it. But additionally, they start laying the foundation in the context. The LLM sees both the final request and the logic the user followed, allowing it to follow that lead and fill in the details as it works.

review_plan in Action

The planning phase isn't just "write tasks and go." The review_plan tool validates release plans to ensure they follow the established process. It also checks for correctness before implementation begins. Here's an example of what it found on Release R15 (Size-Aware Timeout API):

Legitimate Issues Found

Verdict: FAIL

"The core data aggregation logic in BigQuery (Task 433) is fundamentally flawed as it requires [specific data redacted] but lacks any mechanism to access that data within the data warehouse. Without a [redacted data source] in BigQuery, the 'Size-Aware' classification is impossible to implement."

This was a critical architectural blocker—the plan looked complete but couldn't actually work. Claude revised the plan to add data ingestion before re-running review_plan.

Severity: Critical

"A critical contradiction exists in Task 461 regarding the data ingestion strategy (inline constants vs. BigQuery table) which must be resolved to align with dependent tasks."

A logical inconsistency—two tasks made incompatible assumptions about how data would be stored.

Escalated to Human Judgment

"The primary operational risks involve the manual maintenance of the BigQuery reference table (Task 461) and the potential for deployment timeouts if BigQuery is unavailable during cache hydration."

This couldn't be "fixed" by code—it required an operational policy decision about who maintains the reference table and how often. Human input needed. In this case, there was a significant architectural issue that had to be addressed at the highest architectural level, resulting in a dramatically revised plan.

These examples show the importance of collaboration: some issues could be automatically fixed, some required guidance on the correct approach, and some required full systems thinking.

review_plan ran 7 times on this release. Each iteration found issues, Claude revised, and re-ran until the plan passed. This iterative refinement happens before any implementation begins.

Why Discovery Can't Be Skipped

The autonomous implementation cycle reads from a database with:

Task definitions with acceptance criteria
Dependency chains (what blocks what)
Scope boundaries (what's in this release vs. future)

All of that gets created during discovery and planning. The autonomous execution is return on that up-front investment.

Safe Autonomy

Guardrails enable speed.
Not the other way around.

The counterintuitive truth: more structure creates more autonomy. Review gates catch errors at $1, not $100.

2,974 quality gate checks

1,687 design reviews

867 code reviews

420 plan reviews

The pattern: Claude proposes → Gemini validates → Fix or proceed

Why Guardrails Enable Autonomy

We've learned to watch the LLM work because we've seen it go off track and create thousands of lines of code that are entirely the wrong thing. But, when we add guardrails to the harness, the LLM becomes self-correcting. As it drifts off course, it gets nudged back.

While this comes with an up-front cost, it ends up saving time and money in the end.

Bad idea at design phase: ~$1 (a few API calls to reject it)
Bad idea in production: ~$100+ (debugging, rollback, customer impact)

The 2,974 quality checks ensure that not only is the LLM productive, but that it is correct.

Agentic Reviews, Not Static Prompts

Each review tool is an agent with database access. It queries the task DB for acceptance criteria, pulls release details, and examines the actual files. It doesn't just check "is this good code"—it verifies "does this code do what the task said it should do."

This only works because enforced structure guarantees every task HAS acceptance_criteria. No structure → nothing to verify against.

The Three Review Gates

Gate	When	What It Verifies	Calls
review_plan	Before implementation	Release scope aligns with project goals, no gaps	420
review_design	After design, before code	Design satisfies task acceptance criteria and is not over-engineered	1,687
review_code	After implementation	Code implements what the task says + quality checks	867

Multi-Model Verification

Claude typically acts as the orchestrator and implementor. However, these agentic tools do not use Claude, they use Gemini. The key insight: Claude proposes, Gemini validates.

Different model, different biases
Gemini reviews against documented standards (not just "does it look right")
Creates an adversarial dynamic that catches more errors
Both models get the same standards loaded first—consistent context enables consistent judgment

The Review Loop (With Triage)

Please run the review_code tool on the artifacts from Release R17
and fix any issues it reports. Repeat until nothing valid is
reported, even suggestions. If anything needs input from me,
first check the knowledge base to see if I already provided
guidance, but if not stop and tell me what I need to decide
with three options, then record my decisions in the knowledge base.

The key: intelligent triage. The reviewer doesn't dump everything on you:

Issue Type	Action	Human Needed?
Obviously fixable	Auto-fix immediately	No
Needs judgment	Semantic search knowledge base	Maybe
New decision needed	Escalate with 3 options	Yes (once)

Example query: "file storage links widget UUID lookup pattern, structured logging conventions" → finds relevant prior decisions or escalates with structured options.

The benefit of this is that the human doesn't spend time trying to research each escalated issue. The options presented by the AI typically give enough context. They are also often correct, allowing the user to simply choose from a menu to proceed. The rare time full attention is needed is the place where it actually does make a difference.

Human time goes to decisions that matter. Decisions get recorded—so the same question auto-resolves next time.

What Makes It Safe

Structured task database: Work is defined with acceptance criteria before execution
Dependency chains: Tasks can't start until prerequisites are complete
Review gates: Each phase validates before the next begins
Restricted agents: Sandboxed execution with clear scope
Progress tracking: Orchestrator monitors and adapts

The Economics

Those 2,974 review calls cost roughly $50-100 total over 97 days. The bugs they caught would have cost far more to fix in production.

Quality gates are cheap. Production bugs are expensive. The math is obvious once you run it.

The Human Role

You provide judgment.
AI provides execution.

The autonomy pyramid: small fraction of arcs, majority of hours.

Value Delivery
18% arcs → 68% hours

Momentum
37% arcs → 10% hours

Steering
46% arcs → 22% hours

"Most of your time is steering. Most of the output comes from value delivery."

The Three Tiers

Tier	% of Arcs	% of Hours	Your Role
Steering	46%	22%	Make decisions, review, course-correct
Momentum	37%	10%	Kick off routine tasks, verify completion
Value Delivery	18%	68%	Set up the work, walk away

What You Actually Do

Your time goes to high-judgment activities:

Architecture decisions: What to build, how to structure it
Scope definition: What's in, what's out, what's deferred
Quality judgment: Is this good enough? What's missing?
Priority calls: What matters most right now?
Error interpretation: What does this failure mean?

What AI Does

AI handles the execution-heavy work:

Implementation: Writing the actual code
Testing: Creating and running tests
Iteration: Fix-test-fix cycles until green
Documentation: Writing docs that match the code
Coordination: Managing task dependencies

The Leverage

The pyramid shows why this scales:

You spend 46% of arcs on steering (decisions)
But those decisions shape 68% of the autonomous hours
Your judgment gets multiplied, not replaced

Not a Junior Engineer

The fluency of the AI makes it easy to think you should interact with it like you would a junior engineer. The best output comes from realizing that it is a machine that produces code and executes tasks, and you should treat it as such.

It is:

An execution layer that follows structured plans
A verification system that catches its own errors
A force multiplier for your expertise

You don't teach it to code. Instead you create the scaffolding to load the machine, then let it run.

The Economics

$500/month.
543 autonomous hours.

That's $0.92 per hour of AI execution. The leverage is in the volume.

~$500 monthly cost

543 autonomous hours

165 shipped releases

$0.92 per hour

"Tokens aren't a cost to minimize. They're work getting done."

The Math

Metric	Value	Notes
Study period	97 days	Oct 2025 – Jan 2026
Total cost	~$1,600	Two Claude Max+ subscriptions, Gemini API calls
Monthly average	~$500	Varies by activity level
Autonomous hours	543	Agent execution time
Cost per hour	$0.92	$500 / 543 hours

The Mindset Shift

Many practitioners try to minimize token usage. This is backwards.

Old thinking: "Tokens cost money, use fewer"
New thinking: "Tokens are work happening, use more"

Every token spent on review_code is a bug caught early. Every token spent on test generation is coverage you didn't write manually.

Where the Money Goes

Activity	% of Tokens	Value
Implementation	~60%	Code that ships
Review/validation	~20%	Bugs caught early
Exploration	~15%	Decisions made
Overhead	~5%	Coordination, retries

The Real ROI

The 165 shipped releases included:

Backend services, APIs, data pipelines
Frontend components, UI features
Infrastructure (Terraform, CI/CD)
Tests, documentation

One person. 165 releases. $500/month.

Failure Modes

What goes wrong.
How to recover.

1,481 issues caught across 97 days. All contained by guardrails.

Failure	Count	Recovery
Code quality issues	867	review_code fix loop
Blocked dependencies	237	Strategic sequencing
Wrong direction	211	Human correction
Design rejections	70	Revision → re-review
Stuck debugging	61	Escalate or pivot
Scope creep	35	Agent self-corrects

"Guardrails don't prevent failures. They contain them."

Code Quality Issues (867 review_code cycles)

What happened: The agents implemented the code and marked the tasks complete. The reviewer found quality issues—bugs, missing error handling, style violations, test gaps.

Recovery: review_code runs, agent fixes issues, re-runs until clean. The loop is automatic:

review_code flags issues
Agent auto-fixes obvious ones
Escalates judgment calls to knowledge base
Re-runs until nothing reported

867 cycles × issues per cycle = thousands of quality fixes that never reached production.

Blocked Dependencies (237 instances)

What happened: Tasks couldn't proceed because prerequisites weren't complete. In R5.9, 40 of 60 tasks were blocked waiting for design approval.

Recovery: Orchestrator recognized the pattern and changed strategy—spawned design agents first to unblock the chain. Progress moved in waves: 0% → 18% → 58% → 68% → 100%.

Wrong Direction (211 instances)

What happened: Agent focused on wrong signal, wrong file, or wrong approach. Example from logs: investigating local code when the real issue was deployment state.

Recovery: Human provides correction with context. Agent acknowledges and refocuses immediately. Average recovery: one prompt.

"You're right - I was looking at the wrong signal entirely."

Design Rejections (70 instances)

What happened: review_design returned NEEDS_REVISION—incomplete edge cases, architecture misalignment, missing acceptance criteria.

Recovery: Automatic status transition: design_ready → needs_design. Agent revises, re-stores, re-reviews. 100% eventually passed.

Notable Insight: The workflow specifically prevent the implementing LLM from transitioning the task to "in progress" until the reviewer approves the design.

Stuck Debugging (61 instances)

What happened: Agent encountered persistent errors that wouldn't resolve. Same fix attempted repeatedly.

Recovery: Escalation to different model or human provides missing context. The 6h 42m agent hit this—discovered feature wasn't deployed yet, adapted script accordingly.

Scope Creep (35 instances)

What happened: Agent recognized work was outside task boundaries as defined in the release description.

Recovery: Agent self-corrected without human intervention:

"This is out of scope for R1 - I'll focus on the AUTO_FIX item."

The Three Recovery Tiers

Tier	Mechanism	Human Needed?
Tier 1	Build/test/fix loops	No
Tier 2	Agent self-detection (scope, blockers)	No
Tier 3	Human correction	Yes (~15%)

~85% of failures resolved without human intervention. The infrastructure handles it.

The Key Insight

614 failures over 97 days. Zero stopped progress entirely. The system is designed so failures are:

Detected early (review gates at each phase)
Contained (restricted agents, scoped tasks)
Recoverable (automatic rollback, human escalation path)

Build Your Own

The patterns are replicable.
Here's how to start.

The infrastructure that enables these patterns isn't magic. It's local tooling that encodes your workflow knowledge.

Component	Purpose	Start Simple
Task List	Structured work queue	Markdown checklist or PLAN.md
Process Docs	Context priming	Markdown files in repo
Review Gates	Quality checkpoints	Shell script + curl + Gemini API
Knowledge Base	Capture decisions	Markdown file of rulings

"The workflow lives in the resources, not in the prompt."

The Core Insight: Structure Enables Templates

A 47-word prompt triggered 13 hours of coherent work because enforced structure enables template prompts.

The review script requires every task to have specific fields:

id, status, dependencies, acceptance_criteria

Because every task has those fields, you can write prompts that reference them:

"Implement task #X according to its acceptance_criteria"
"Check if dependencies are complete before starting"
"Verify the code does what the task says"

No enforced structure → no consistent fields → no template prompts → no automation.

The knowledge base adds learning: when the agent hits a decision point, it checks for prior guidance first. Your rulings accumulate—the system learns without the AI remembering.

Start Monday: Week 1

Create a PLAN.md: Task checklist with acceptance criteria in markdown
Write one process doc: "How we implement features" in markdown
Reference both in prompts: "Read PLAN.md and process.md, then implement task #1"
That's it. No database, no MCP—just markdown files.

Week 2: Add Review Gates

Write a shell script: ~100 lines of bash + curl + jq
Embed your standards: Put review criteria directly in the prompt
Call Gemini API: Send Claude's output for independent review
Run it manually: ./review-plan.sh PLAN.md

Week 3: Build the Loop

Add status tracking: todo → in_progress → review → done
Add dependency checks: Can't start until dependencies are done
Create orchestration prompt: "Check task database, spawn agents for ready tasks"
Test a multi-task run: Queue 3-5 tasks, let it execute

Week 4: Add the Knowledge Base

Create a rulings file: decisions.md with categories (code style, architecture, etc.)
Update review prompt: "If you need a decision, check knowledge base first"
Record new decisions: When you answer a question, add it to the file
Watch it compound: Fewer questions over time as guidance accumulates

The Evolution

The infrastructure in this study evolved over 5 months:

June 2025: Shell scripts + markdown files (~700 LOC total)
August 2025: MCP server in Go + SQLite (same philosophy, better tooling)
August–January: 525 commits, 2 → 60 tools, single → multi-model

You don't need 60 tools to start. You need PLAN.md and a shell script that calls Gemini.

What You're Actually Building

This isn't about AI replacing you. It's about building scaffolding that lets AI magnify your judgment:

Your decisions get encoded in process docs
Your standards get enforced by review gates
Your workflow gets executed at scale

The AI does the typing. You provide the expertise.

This Scales to Teams

This study shows one person's results, but the patterns are team infrastructure. Process docs, review gates, and knowledge bases don't belong to an individual—they belong to a codebase. Once they exist, every engineer on the team benefits:

Shared process docs mean any team member can delegate the same way
Shared review gates enforce team standards automatically
Shared knowledge base captures decisions once, applies them everywhere

The force multiplication compounds. One person with these patterns produced 543 autonomous hours. A team of five, with shared infrastructure, doesn't get 5× — they get the compounding effect of shared context, shared standards, and shared learning.

The Payoff

543 autonomous hours

165 shipped releases

~$500 monthly cost

1 person

Analyze Your Own Logs

The analysis tools used in this research are open source: github.com/mrothroc/claude-code-log-analyzer

Measure autonomous work hours, detect work arcs, and cluster prompts from your Claude Code logs at ~/.claude/projects/.

"AI makes me 10x productive""AI produces garbage"

The Online Discourse

The Gap Is Real

What This Presentation Offers

About the Data

Seven patterns.One power law.

The Three Tiers

Tier 1: Steering (46% of arcs, 22% of hours)

Tier 2: Momentum (37% of arcs, 10% of hours)

Tier 3: Value Delivery (18% of arcs, 68% of hours)

The Power Law

Pattern Definitions

Key Insight

42% templates.58% steering.

What Gets Templated (42%)

Why Templates Work

Document loading Enables Consistent Decomposition

What Stays Adaptive (58%)

The Key Insight

One prompt.13 hours later: shipped.

The Prompt

Context Priming: The Critical First Step

Then: Analyze and Execute

What Actually Shipped

Not a Fluke: 29 Release Arcs

Before you can automate,you have to explore.

The core Human-Provided Value

The Four Phases

Exploration Prompts

review_plan in Action

Legitimate Issues Found

Escalated to Human Judgment

Why Discovery Can't Be Skipped

Guardrails enable speed.Not the other way around.

Why Guardrails Enable Autonomy

The Three Review Gates

Multi-Model Verification

The Review Loop (With Triage)

What Makes It Safe

The Economics

You provide judgment.AI provides execution.

The Three Tiers

What You Actually Do

What AI Does

The Leverage

Not a Junior Engineer

$500/month.543 autonomous hours.

The Math

The Mindset Shift

Where the Money Goes

The Real ROI

What goes wrong.How to recover.

Code Quality Issues (867 review_code cycles)

Blocked Dependencies (237 instances)

Wrong Direction (211 instances)

Design Rejections (70 instances)

Stuck Debugging (61 instances)

Scope Creep (35 instances)

The Three Recovery Tiers

The Key Insight

The patterns are replicable.Here's how to start.

The Core Insight: Structure Enables Templates

Start Monday: Week 1

Week 2: Add Review Gates

Week 3: Build the Loop

Week 4: Add the Knowledge Base

The Evolution

What You're Actually Building

This Scales to Teams

The Payoff

Analyze Your Own Logs

"AI makes me 10x productive"
"AI produces garbage"

Seven patterns.
One power law.

42% templates.
58% steering.

One prompt.
13 hours later: shipped.

Before you can automate,
you have to explore.

Guardrails enable speed.
Not the other way around.

You provide judgment.
AI provides execution.

$500/month.
543 autonomous hours.

What goes wrong.
How to recover.

The patterns are replicable.
Here's how to start.