The Divide
Both are true. The difference is technique.
I'm a top performer that achieves maximum productivity with autonomous coding agents. I worked with Claude Code to examine my actual logs to see how.
97 days. 543 autonomous hours. Here's the data.
14,926 prompts · 2,314 agent sessions · 165 shipped releases
Online forums have always been rife with arguments springing from strong opinions, and AI coding productivity is no exception. The conversations follow the same basic pattern:
The different perceptions are based in individual truth. High performers aren't lying. Skeptics aren't wrong. The gap exists because:
Those are big words, but the real proof is in the data. This is a data-driven, high-level view of how to get the work done. By examining actual session logs, patterns emerge that show typical work cycles. This includes the setup, the tools called, and the guardrails that allow it to run without constant supervision.
Finally, it gives you the concrete steps to replicate this yourself.
Everything here comes from one developer's Claude Code logs:
| Metric | Value |
|---|---|
| Date range | Oct 2, 2025 – Jan 2026 (97 days) |
| Total prompts | 14,926 |
| Autonomous agent sessions | 2,314 |
| Autonomous hours | 543 |
| Concurrent workstreams | 6 |
| Shipped releases | 165 |
| Monthly cost | ~$500 |
The practitioner has 35 years of professional SaaS and software engineering experience. The processes encode engineering management best practices.
What the Data Shows
650 work arcs clustered into distinct types. 5% of arcs produce 48% of autonomous hours.
| Pattern | % Arcs | % Hours | Avg Duration |
|---|---|---|---|
| Release | 4.5% | 48% | 10.3 hours |
| Feature | 11.8% | 23% | 112 min |
| Build | 14.5% | 8% | 33 min |
| Review | 24.9% | 10% | 23 min |
| Interactive | 20.9% | 12% | 33 min |
| Quick | 22% | 2% | 5 min |
| Debug | 1.4% | 3% | 118 min |
The leverage is in the long arcs. The short ones enable them.
The logs show that there are "arcs" of productivity, where a group of related prompts has a natural beginning, middle, and end. The arcs naturally cluster into three tiers by autonomy level:
Human-in-the-loop collaboration. No agents spawned.
This is where decisions happen and human input is most valuable. The human uses the LLM as a thought partner to explore architectural choices, examine debugging hypotheses, and define the scope of work.
Short autonomous bursts for routine tasks.
Work here is about breaking through barriers to keep work moving. The LLM is an assistant that runs tests, fixes linting, and deploys changes.
Extended autonomous execution. This is where output happens.
The distribution is not uniform:
| Pattern | Trigger | What Happens |
|---|---|---|
| Release | "burn down tasks in Release X" | Orchestrator reads task graph, spawns waves of agents, monitors to completion |
| Feature | "implement X" (multi-task) | 2-5 agents work through related tasks over 1-4 hours |
| Build | "run tests" / "fix the build" | Iterative fix-test-fix cycles until green |
| Review | "review this code" | Human-guided review, AI executes checks |
| Interactive | Discussion, questions | Back-and-forth exploration, no agents |
| Quick | Single small task | Fast execution, minimal coordination |
| Debug | "investigate X" / "fix this bug" | Hypothesis testing, trace analysis, systematic investigation |
The short arcs (steering, momentum) create the conditions for long arcs (value delivery). You can't skip to release arcs—you need the planning, review, and debugging cycles to set them up.
The Prompt Split
Templates enable autonomy. But human judgment still drives the majority of interaction.
"The templates handle the routine. You handle the decisions."
These are the repeatable commands that trigger autonomous work:
| Pattern | Count | Example |
|---|---|---|
| Release planning | 165 | "Read the process docs, create Release X with tasks that have verifiable acceptance criteria" |
| Task delegation | 403 | "burn down tasks in Release X" |
| Confirmations | 376 | "Yes please" |
| Review requests | 324 | "run review_code on Release X" |
| Build/test checks | 358 | "Does it build? Do tests pass?" |
| Deploy commands | 215 | "commit and push" |
Note the ratio: 403 delegation prompts ÷ 165 releases = ~2.4x. Each release is planned once, but execution spans multiple sessions. Context fills up, you step away, you come back—each restart requires re-delegating. This is the natural rhythm: stable planning units, chunked execution.
Templates aren't just shortcuts—they work because enforced structure gives them consistent fields to reference:
The same prompt triggers the same workflow because the underlying data has the same structure.
The "release planning" template is the most powerful example. The process docs exposed as MCP resources define:
When you say "read the process docs then create the release," the LLM loads all this into the context window. The result: 165 releases decomposed exactly the same way—iterative structure, proper task dependencies, verifiable acceptance criteria.
The docs aren't just documentation. They're the playbook that context priming injects into every planning session.
The "noise" in clustering analysis—prompts that don't match templates—represents human steering:
| Category | Examples |
|---|---|
| Specific instructions | "fix the test", "fix the high priority ones" |
| Context questions | "What is in R6?", "Is it in the openapi spec?" |
| Bug investigation | "I see this error, where is the host defined" |
| Architecture decisions | "For the MCP, everything goes via the BFF" |
| Progress tracking | "Please mark 482 as completed" |
| Clarifications | "It's not claude desktop its claude code" |
The 58% "adaptive" prompts aren't failure to templatize—they're the human-in-the-loop providing:
Templates create a foundation for consistent execution. Steering ensures the execution produces value.
The Release Cycle
"Kicked it off. Went to bed. Woke up to a deployable release."
Please review the docs/prompts exposed as resources by the project MCP to understand the process then spawn appropriate restricted agents in the foreground to burn down the tasks in [Release v2.5].
47 words. But this prompt only works because it is built on a foundation that ensures consistency. The "docs/prompts" define how to examine the release, how to determine the needed agents, and how to spawn "restricted" agents.
The docs also give the guideline for the agent prompts themselves. They are given a specific workflow to follow that constrains their work to the task requirements.
The orchestrator's first action was to read the exposed resources: the process docs, workflow definitions, and project standards. This primes the context window with exactly the same knowledge every time.
Why this matters: The AI doesn't "remember" from last time. Every session starts fresh. Consistent results come from consistent context, not from learning. By loading the same resources first, every run operates from the same baseline.
With context primed, the orchestrator analyzed the work:
"Current v2.5 Status: 60 total tasks. 40 tasks need design (must go through design → review before implementation). 7 tasks in todo (can be claimed directly)."
Then executed in waves:
| Time | Progress | Decision | Agents |
|---|---|---|---|
| +0h | 0% | Spawn design + frontend agents (unblock the chain) | 2 |
| +0.6h | 18% | Design done → spawn implementation agents | 3 |
| +11h | 58% | Core complete → spawn remaining work | 3 |
| +12.3h | 68% | Almost done → final frontend + tests | 2 |
| +12.9h | 100% | "v2.5 - COMPLETE!" | — |
This wasn't a lucky one-off. Over 97 days, the same pattern ran 29 times:
Stable context. Stable input. Stable output.
The Discovery Cycle
46% of arcs are interactive—no agents spawned. This is where decisions happen.
"The release cycle only works if you've done discovery first."
LLMs have evolved into enormously powerful engines that can produce vast amounts of artifacts. However, they need to be guided to produce output that is actually valuable. Without this, they can produce a mountain of garbage just as easily as gold.
Interactive arcs are back-and-forth collaboration where the human uses the LLM as a thought partner to define both the final goal and the best path to get there. This guidance is arguably the most important part, because this is the fuel that powers the AI.
This conversation snippet shows how a typical session evolves. After a conversation about the goals and implementation approach, we see if Claude is ready to implement:
Human: Do you have enough context in the conversation to understand the final destination of this implementation journey? Please describe it to me.
Claude: Based on what I have in context, I can describe the architectural foundation but I'm missing the complete product vision.
What I understand: [Claude correctly describes the architecture]
This shows how the person checks for understanding in the context. Claude is clear that it doesn't know enough to complete the task. Claude wants to be helpful, so if we launched it on an implementation are at this point it would try its best, but we have no guarantee the final product is going to be useful.
However, if we ensure it has what it needs before launching an autonomous arc, the product will be what we expect. This is manual context priming.
Other examples of work done in this phase include:
Every major deliverable follows this sequence:
| Phase | Mode | What Happens |
|---|---|---|
| 1. Exploration | Interactive | Discover what to build. Back-and-forth discussion, research, prototyping. |
| 2. Planning | Guided | Create tasks, capture dependencies, structure the release. AI drafts according to a defined approach. Run review_plan then work collaboratively to refine the plan. |
| 3. Implementation | Autonomous | Orchestrator reads the plan and executes. This is the "13 hours" part. |
| 4. Review | Autonomous loop | Run review_code, fix issues, repeat until clean. |
These are the 58% "adaptive" prompts that can't be templated:
"What's the simplest way to add event sourcing here?" "I'm seeing timeouts on the cloud function. Where should I start investigating?" "For v2.5, we need trend analysis. What data do we already have that we could use?"
These conversations help the user refine their ideas at the outset, bringing clarity in both what they want and how to do it. But additionally, they start laying the foundation in the context. The LLM sees both the final request and the logic the user followed, allowing it to follow that lead and fill in the details as it works.
The planning phase isn't just "write tasks and go." The review_plan tool validates release plans to ensure they follow the established process. It also checks for correctness before implementation begins. Here's an example of what it found on Release R15 (Size-Aware Timeout API):
Verdict: FAIL
"The core data aggregation logic in BigQuery (Task 433) is fundamentally flawed as it requires [specific data redacted] but lacks any mechanism to access that data within the data warehouse. Without a [redacted data source] in BigQuery, the 'Size-Aware' classification is impossible to implement."
This was a critical architectural blocker—the plan looked complete but couldn't actually work. Claude revised the plan to add data ingestion before re-running review_plan.
Severity: Critical
"A critical contradiction exists in Task 461 regarding the data ingestion strategy (inline constants vs. BigQuery table) which must be resolved to align with dependent tasks."
A logical inconsistency—two tasks made incompatible assumptions about how data would be stored.
"The primary operational risks involve the manual maintenance of the BigQuery reference table (Task 461) and the potential for deployment timeouts if BigQuery is unavailable during cache hydration."
This couldn't be "fixed" by code—it required an operational policy decision about who maintains the reference table and how often. Human input needed. In this case, there was a significant architectural issue that had to be addressed at the highest architectural level, resulting in a dramatically revised plan.
These examples show the importance of collaboration: some issues could be automatically fixed, some required guidance on the correct approach, and some required full systems thinking.
review_plan ran 7 times on this release. Each iteration found issues, Claude revised, and re-ran until the plan passed. This iterative refinement happens before any implementation begins.
The autonomous implementation cycle reads from a database with:
All of that gets created during discovery and planning. The autonomous execution is return on that up-front investment.
Safe Autonomy
The counterintuitive truth: more structure creates more autonomy. Review gates catch errors at $1, not $100.
The pattern: Claude proposes → Gemini validates → Fix or proceed
We've learned to watch the LLM work because we've seen it go off track and create thousands of lines of code that are entirely the wrong thing. But, when we add guardrails to the harness, the LLM becomes self-correcting. As it drifts off course, it gets nudged back.
While this comes with an up-front cost, it ends up saving time and money in the end.
The 2,974 quality checks ensure that not only is the LLM productive, but that it is correct.
Agentic Reviews, Not Static Prompts
Each review tool is an agent with database access. It queries the task DB for acceptance criteria, pulls release details, and examines the actual files. It doesn't just check "is this good code"—it verifies "does this code do what the task said it should do."
This only works because enforced structure guarantees every task HAS acceptance_criteria. No structure → nothing to verify against.
| Gate | When | What It Verifies | Calls |
|---|---|---|---|
| review_plan | Before implementation | Release scope aligns with project goals, no gaps | 420 |
| review_design | After design, before code | Design satisfies task acceptance criteria and is not over-engineered | 1,687 |
| review_code | After implementation | Code implements what the task says + quality checks | 867 |
Claude typically acts as the orchestrator and implementor. However, these agentic tools do not use Claude, they use Gemini. The key insight: Claude proposes, Gemini validates.
Please run the review_code tool on the artifacts from Release R17 and fix any issues it reports. Repeat until nothing valid is reported, even suggestions. If anything needs input from me, first check the knowledge base to see if I already provided guidance, but if not stop and tell me what I need to decide with three options, then record my decisions in the knowledge base.
The key: intelligent triage. The reviewer doesn't dump everything on you:
| Issue Type | Action | Human Needed? |
|---|---|---|
| Obviously fixable | Auto-fix immediately | No |
| Needs judgment | Semantic search knowledge base | Maybe |
| New decision needed | Escalate with 3 options | Yes (once) |
Example query: "file storage links widget UUID lookup pattern, structured logging conventions" → finds relevant prior decisions or escalates with structured options.
The benefit of this is that the human doesn't spend time trying to research each escalated issue. The options presented by the AI typically give enough context. They are also often correct, allowing the user to simply choose from a menu to proceed. The rare time full attention is needed is the place where it actually does make a difference.
Human time goes to decisions that matter. Decisions get recorded—so the same question auto-resolves next time.
Those 2,974 review calls cost roughly $50-100 total over 97 days. The bugs they caught would have cost far more to fix in production.
Quality gates are cheap. Production bugs are expensive. The math is obvious once you run it.
The Human Role
The autonomy pyramid: small fraction of arcs, majority of hours.
"Most of your time is steering. Most of the output comes from value delivery."
| Tier | % of Arcs | % of Hours | Your Role |
|---|---|---|---|
| Steering | 46% | 22% | Make decisions, review, course-correct |
| Momentum | 37% | 10% | Kick off routine tasks, verify completion |
| Value Delivery | 18% | 68% | Set up the work, walk away |
Your time goes to high-judgment activities:
AI handles the execution-heavy work:
The pyramid shows why this scales:
The fluency of the AI makes it easy to think you should interact with it like you would a junior engineer. The best output comes from realizing that it is a machine that produces code and executes tasks, and you should treat it as such.
It is:
You don't teach it to code. Instead you create the scaffolding to load the machine, then let it run.
The Economics
That's $0.92 per hour of AI execution. The leverage is in the volume.
"Tokens aren't a cost to minimize. They're work getting done."
| Metric | Value | Notes |
|---|---|---|
| Study period | 97 days | Oct 2025 – Jan 2026 |
| Total cost | ~$1,600 | Two Claude Max+ subscriptions, Gemini API calls |
| Monthly average | ~$500 | Varies by activity level |
| Autonomous hours | 543 | Agent execution time |
| Cost per hour | $0.92 | $500 / 543 hours |
Many practitioners try to minimize token usage. This is backwards.
Every token spent on review_code is a bug caught early. Every token spent on test generation is coverage you didn't write manually.
| Activity | % of Tokens | Value |
|---|---|---|
| Implementation | ~60% | Code that ships |
| Review/validation | ~20% | Bugs caught early |
| Exploration | ~15% | Decisions made |
| Overhead | ~5% | Coordination, retries |
The 165 shipped releases included:
One person. 165 releases. $500/month.
Failure Modes
1,481 issues caught across 97 days. All contained by guardrails.
| Failure | Count | Recovery |
|---|---|---|
| Code quality issues | 867 | review_code fix loop |
| Blocked dependencies | 237 | Strategic sequencing |
| Wrong direction | 211 | Human correction |
| Design rejections | 70 | Revision → re-review |
| Stuck debugging | 61 | Escalate or pivot |
| Scope creep | 35 | Agent self-corrects |
"Guardrails don't prevent failures. They contain them."
What happened: The agents implemented the code and marked the tasks complete. The reviewer found quality issues—bugs, missing error handling, style violations, test gaps.
Recovery: review_code runs, agent fixes issues, re-runs until clean. The loop is automatic:
867 cycles × issues per cycle = thousands of quality fixes that never reached production.
What happened: Tasks couldn't proceed because prerequisites weren't complete. In R5.9, 40 of 60 tasks were blocked waiting for design approval.
Recovery: Orchestrator recognized the pattern and changed strategy—spawned design agents first to unblock the chain. Progress moved in waves: 0% → 18% → 58% → 68% → 100%.
What happened: Agent focused on wrong signal, wrong file, or wrong approach. Example from logs: investigating local code when the real issue was deployment state.
Recovery: Human provides correction with context. Agent acknowledges and refocuses immediately. Average recovery: one prompt.
"You're right - I was looking at the wrong signal entirely."
What happened: review_design returned NEEDS_REVISION—incomplete edge cases, architecture misalignment, missing acceptance criteria.
Recovery: Automatic status transition: design_ready → needs_design. Agent revises, re-stores, re-reviews. 100% eventually passed.
Notable Insight: The workflow specifically prevent the implementing LLM from transitioning the task to "in progress" until the reviewer approves the design.
What happened: Agent encountered persistent errors that wouldn't resolve. Same fix attempted repeatedly.
Recovery: Escalation to different model or human provides missing context. The 6h 42m agent hit this—discovered feature wasn't deployed yet, adapted script accordingly.
What happened: Agent recognized work was outside task boundaries as defined in the release description.
Recovery: Agent self-corrected without human intervention:
"This is out of scope for R1 - I'll focus on the AUTO_FIX item."
| Tier | Mechanism | Human Needed? |
|---|---|---|
| Tier 1 | Build/test/fix loops | No |
| Tier 2 | Agent self-detection (scope, blockers) | No |
| Tier 3 | Human correction | Yes (~15%) |
~85% of failures resolved without human intervention. The infrastructure handles it.
614 failures over 97 days. Zero stopped progress entirely. The system is designed so failures are:
Build Your Own
The infrastructure that enables these patterns isn't magic. It's local tooling that encodes your workflow knowledge.
| Component | Purpose | Start Simple |
|---|---|---|
| Task List | Structured work queue | Markdown checklist or PLAN.md |
| Process Docs | Context priming | Markdown files in repo |
| Review Gates | Quality checkpoints | Shell script + curl + Gemini API |
| Knowledge Base | Capture decisions | Markdown file of rulings |
"The workflow lives in the resources, not in the prompt."
A 47-word prompt triggered 13 hours of coherent work because enforced structure enables template prompts.
The review script requires every task to have specific fields:
Because every task has those fields, you can write prompts that reference them:
No enforced structure → no consistent fields → no template prompts → no automation.
The knowledge base adds learning: when the agent hits a decision point, it checks for prior guidance first. Your rulings accumulate—the system learns without the AI remembering.
./review-plan.sh PLAN.mdThe infrastructure in this study evolved over 5 months:
You don't need 60 tools to start. You need PLAN.md and a shell script that calls Gemini.
This isn't about AI replacing you. It's about building scaffolding that lets AI magnify your judgment:
The AI does the typing. You provide the expertise.
This study shows one person's results, but the patterns are team infrastructure. Process docs, review gates, and knowledge bases don't belong to an individual—they belong to a codebase. Once they exist, every engineer on the team benefits:
The force multiplication compounds. One person with these patterns produced 543 autonomous hours. A team of five, with shared infrastructure, doesn't get 5× — they get the compounding effect of shared context, shared standards, and shared learning.
The analysis tools used in this research are open source: github.com/mrothroc/claude-code-log-analyzer
Measure autonomous work hours, detect work arcs, and cluster prompts from your Claude Code logs at ~/.claude/projects/.