Writing

Measuring Engines with Horses

An OpenAI engineering post from February has resurfaced and is making quite the waves on social media. People are focusing on one number from the article: 3.5 pull requests per engineer per day.

Everyone is arguing about that number. Is it impressive? Is it reckless? Personally, I think the entire argument is in the wrong domain.

I spend a lot of time studying agent reliability and I run my own autonomous coding pipeline, so I dove into the article. Most of it is about the scaffolding: enforced architecture, feedback loops, the guardrails. Out of all of that, the only part that made it into the social media discourse was the PR number, because that's what everyone understands.

And that's the problem. PRs per engineer per day are just another way of counting lines of code as output. That was always a questionable metric, but at least it was connected to what engineers actually did: they wrote code.

But the typing isn't the work anymore. The coding agents are taking over that part, and are well on their way to doing all of it. So we're trying to apply yesterday's term to the new thing that replaced it.

It's the horsepower problem. Watt sold steam engines in horsepower because that's what people understood. No one had a unit for engines. That made sense then, but it really seems strange now, so far from the world of horse-driven machines.

We don't have a good unit for the part that actually matters: whether the agents' output is any good and if we can trust it to work as we expect. Until we do, the old term wins by default, and we're just going to be measuring engines with horses.

The framework behind this: Trust Topology →