Biz: Stop the Theater
Back in the day—before agile, or what I now call “post-agile”—software projects began with multi-page RFPs: 300-page tomes listing every imagined requirement for sprawling software suites.
We put in enormous effort to estimate these beasts. Looking back, it seems absurd, but that was the game—and the undisputed champion was my friend, let’s call him Tuomas1.
Tuomas was an estimation wizard, but not because he read specs. In fact, he never read them. We’d print the whole requirements document, hand him the stack, and he’d literally weigh it in his hands. He’d sniff it. He’d examine the binding.
Then, he’d toss out a number: “350 days,” or “720 days.”
After weeks of “rigorous” analysis, an entire team would often return with a figure eerily close to his.
The Heuristic Engine
It became an office joke, but the truth ran deeper. Tuomas was our most seasoned engineer2, having seen this particular brand of noise so many times that he no longer needed to read the details.
He was running a pattern-matching heuristic in his brain. By assessing the folder’s thickness, the client’s name, and the layout of the cover page, he could sniff out the uncertainty. He wasn’t estimating features; he was estimating complexity and chaos against a massive internal corpus of past projects.
He was right3. And it proved that most of our “estimation theater”—the spreadsheets, the breakdowns, the debates—was just that: theater.
Tuomas eventually left for a product company to escape the B2B madness. The tragedy was that his skill couldn’t be scaled. You can’t document “sniffing the binder.”
The Failure of RICE
Most of us aren’t Tuomas. So, we invent frameworks to compensate.
Recently, I’ve been using RICE (Reach, Impact, Confidence, Effort). On paper, it’s a great way to communicate prioritization. In practice, though, it falls apart.
Impact turns into a stand-in for the scorer’s personal product vision. Confidence becomes a measure of ego. Effort? That’s just vibes4.
Even with nifty tricks like Planning Poker and group estimates5, the noise persists. Different person, different day—different score.
Enter the LLM: The Scalable Thomas
Here’s the uncomfortable truth: if we want the consistency of Tuomas—without the burnout and cynicism—we need a new approach.
My argument is simple: replace the estimation meeting with a stochastic parrot6.
Feed a Large Language Model the same vision document, the same backlog, and the same customer data, and you get a strictly consistent lens. It doesn’t have bad days. It’s immune to office politics and the “HiPPO”7 effect.
Is it perfect? No. But its bias is at least predictable—meaning you can measure, calibrate, and correct for it. Human bias is chaos: different person, different day, different score. You can’t calibrate chaos.
Why Not Proper ML?
Here’s the obvious objection: why not use actual data science? Monte Carlo simulations, reference class forecasting, AutoML on your historical data—if you have it, by all means, use it.
But most organizations don’t. What they have is a few years of Jira tickets riddled with inconsistent story points, half-filled priorities, and five different estimation scales—making the data unusable.
LLMs excel precisely because they don’t require clean, structured data. They can interpret messy specs, rambling prose, and ambiguous context, cross-referencing8 patterns from thousands of projects. For the 95% of teams that will never have pristine datasets, this is the only realistic option.
Tuomas didn’t rely on clean data either. He thrived on experience with chaos.
End the Estimation Theater
By using LLMs to generate RICE scores grounded in real context and product vision, we finally unlock the framework’s true value. No more meetings where we act out the ritual of data-driven decision-making while relying on gut instinct.
This could be our way past estimation theater—not by discarding structure, but by running it on a system that doesn’t get tired, play politics, or get anchored by the loudest voice in the room.
The Graveyard of Frameworks
The bigger revelation is this: RICE isn’t the only casualty. Countless frameworks met the same fate—not because their logic was flawed, but because humans couldn’t execute them consistently.
Cost of Delay—a brilliant idea, rendered impossible by subjective scoring. WSJF9? Same story. Reference class forecasting? Useless without meticulous, maintained records.
There’s a graveyard of methodologies that were theoretically sound but practically doomed, all tripped up by a missing ingredient: scalable, unbiased human judgment.
If LLMs can unlock RICE, they can resurrect all of them. The frameworks weren’t the problem.
We simply weren’t the right runtime.
Footnotes
-
As a funny sidenote: there is a uncanny correlation in the finnish tech scene, with technical prowess and being named Tuomas. ↩
-
Or second most seasoned. There is little point in tit-tat on who was the king of the hill. ↩
-
In a statistical sense. If his guesswork produced numbers that were close enough, with little or no effort - there was really no point in wasting a weeks worth of an engineering time to produce almost the same result. ↩
-
Yeah, you can use reference class forecasting like I have mostly done, or rely on experience and intuition. But I’d argue very few of us have the experience and intuition to make accurate predictions. Never mind the case files required for reference class forecasting. ↩
-
Ok, group estimates could work. But see my arguments on proper statistics a bit below the referencing paragraph. ↩
-
Yeah, this is a bit blunt. But it drives the point. My intention is not to claim LLM’s have intelligence, or that they could analyze data better than humans. But to show that in this specific case, LLMs can provide a more consistent and scalable approach. ↩
-
HiPPO stands for Highest Paid Person’s Opinion. It’s a term used to describe a situation where a decision is made based on the opinion of the person with the highest position or authority, rather than on objective data or analysis. ↩
-
The model does not really cross reference anything, but the statistical approximation of cross-referencing is enough for our use case here. ↩
-
WSJF – or Weighted Shortest Job First is perhaps the most absurd example of trying to over-engineer a patch for a “human factor”. It is algorithmically neat, though. ↩