The AI Boom’s Multi-Billion Dollar Blind Spot

Everyone’s betting on AI getting smarter. The amazing thing is they can reason. We’re just at the beginning of the reasoning AI era. Smarter models, sharper intuition, superintelligence. I think we’ll get superintelligence, and I would guess that it will be a continuation of this trend that humanity has been on for 100 plus years.

Fueling explosive new demand for compute. The amount of computation necessary to do that reasoning process is 100 times more than what we used to do. And companies going all in, spending billions just to avoid falling behind. We now spend about $2 billion on AI. We have something like 600 actual in- use cases.

That number will probably double next year and maybe triple. But what if it’s all been over-promised and we’re heading straight into a bottomless money pit? New research is casting doubt on the hype. In a new paper, they’re throwing cold water on the growing trend of reasoning models and whether they’re really more accurate. Is the AI actually getting smarter or is it just an illusion?

I’m Deirdre Bosa with the Tech Check Take: AI’s reasoning blind spot. AI reasoning: it’s the new frontier. The next supposed leap towards superintelligence. Forget chatbots that just answer questions in a black box. These models, they think, they show their work, they break problems into steps.

And that’s a shift from predicting words to planning actions. The reason it’s hard for GPT-4o is it has to get it correct on the first try. It can’t check that it meets the constraints and then revise the poem.

Now let’s try the same poem with o1- preview, and we’ll see that differing from GPT -4o, o1-preview starts thinking before giving the final answer. The idea is that the more it thinks, the smarter it gets: just like a human.

We know t hat thinking is oftentimes more than just one shot, and thinking requires us to maybe do multi- plans, multiple potential answers that we choose the best one from: just like when we’re thinking. W e might reflect on the answer before we deliver the answer: r eflection. We might take a problem and break it down into step by step by step: chain of thought.

OpenAI, Anthropic, Google, DeepSeek: they all jumped on this trend model after model, drop after drop. o1, o1 Pro, o3, o3 Mini, Sonnet 4, Opus 4, R1, Gemini 2.

0 Flash: each claiming exceptional reasoning capabilities, each promising to be the most powerful yet. Except a string of research papers is calling that promise into question. The highest profile from a team at Apple bluntly titled “The Illusion of Thinking,” it concludes that once problems get hard enough, reasoning models, they stop working. Let’s break it down with a logic puzzle, as Apple does. The Towers of Hanoi: three rods, a stack of discs.

The goal is to move all the discs to another rod: big on bottom, small on top, in the fewest possible moves.

It’s something you might give a toddler to build cognitive skills, but it’s also a stress test for reasoning. In the simplest version of the game, just three discs, Anthropic’s model perform the same with or without reasoning. Add a few more discs, reasoning outperforms, b ut then comes the twist: after seven discs, performance collapses to zero accuracy. And not just for Anthropic’s model, but for DeepSeek’s and OpenAI’s as well.

The harder the task was supposed to get, the worse the models performed. Same thing happened with other puzzles: checkers, river crossing problems. These are basic logic games. So what’s the real takeaway here? Well, these models, they look like they’re thinking, but what they actually may be doing is pattern matching.

When the puzzle is familiar, when it’s something that they’ve seen in training, they do fine, but throw them something new, complex, untested, they fail.

What looks like intelligence might just be memorization at scale. There’s a lot of debate, at least in terms of how quickly things will be reached and how fast they’re actually progressing when it comes, certainly to the reasoning side. It’s not just Apple either. Salesforce calls it “jagged intelligence” and finds that there’s a significant gap between current LLM capabilities and real-world enterprise demand.

Anthropic, the maker of one of the most advanced reasoning models, raised red flags in a recent paper of its own titled “Reasoning Models Don’t Always Say What They Think.” And the Chinese research lab LEAP, well, it found that today’s AI training methods haven’t been able to elicit genuinely novel reasoning abilities. We can make it do really well on benchmarks. We can make it do really well on specific tasks, and that’s valuable: like, you know, we have a lot of agents who want to do specific tasks. I think the things that’s not well understood, and that’s some of those papers you allude to show, is that it doesn’t generalize.

So, yeah, while it might be really good at this task, it’s awful at very common sense things that you and I would do in our sleep. And that’s, I think, a fundamental limitation of the reasoning models right now. In other words, these models aren’t just limited: they don’t generalize. That means they’re just learning how to perform on tests that they’ve seen but are unable to face real -world tests. We know how to build one that’s really good at Towers of Hanoi, but then there’s a million problems like that, and this is the problem.

If it doesn’t generalize, you have to train them again and again and again. So I think we’re going into an era where we’re going to see much more diversity of AI models. That might work for now, specialized AI built for narrow jobs or built to beat benchmarks, but that is not the goal. The Holy Grail is superintelligence, a system that can reason, adapt, think beyond what it was trained on: a system that is smarter than us.

And on that front.

.. I think the sort of artificial superintelligence is much farther away than we thought. That’s the key takeaway. I think we can still have big impact with AI in, you know, everyday life, but I think the superintelligence as the thing that’s all, you know, knowing and can do everything, that’s many, many more years out.

Probably, we need major breakthroughs that we don’t have yet to get there. The industry promised superintelligence. What we may have gotten instead are narrow benchmarks and shallow reasoning. So is the industry chasing the wrong kind of intelligence? And what would that mean for investors and the AI trade at large?

Because if reasoning models do work, they would in theory need way more compute, a nd that could extend the infrastructure boom that’s driving stocks like Nvidia. Jensen Huang himself has said that reasoning models will require mountains more compute than previous ones. The amount of computation we need at this point, as a result of agentic AI, as a result of reasoning, is easily a hundred times more than we thought we needed this time last year.

But if they don’t scale, well, that raises deeper questions about how far today’s AI can really go and whether enterprises are pouring billions of dollars into AI just to keep up with no guarantee of payoff. The AI industry is built on a simple idea: scale works.

The bigger the model, the more data it’s fed, the smarter it gets. It’s what experts call “the scaling law.” One of the properties of machine learning, of course, is that the larger the brain, the more data we can teach it, the smarter it becomes. We call it “the scaling law.” There’s every evidence that as we scale up the size of the models, the amount of training data, the effectiveness, the quality, the performance of the intelligence improves.

But when that starts to break, when models stop improving, it shakes the foundation. AI hits a wall. And the last time that happened, around November of 2024, the industry entered a full-blown existential crisis. D ebate around progress s talling and China catching up, that hit the public AI t rade with Nvidia falling into correction territory early on in 2025.

And everyone weighed in, from OpenAI CEO Sam Altman saying “There is no wall,” to Anthropic CEO Dario Amodei and Nvidia’s Jensen Huang.

People call them “scaling laws.” That’s a misnomer like Moore’s Law is a misnomer. Moore’s law, scaling laws, they’re not laws of the universe: they’re empirical regularities. I am going to bet in favor of them continuing, but I’m not certain of that. This is where almost the entire world got it wrong.

As pre-training progress stalled, reasoning was supposed to be the escape hatch, a new kind of intelligence, harder to measure, but full of potential. If it panned out, it would justify the next wave of spending and keep the AI trade alive: chipmakers, hyperscalers, they rebounded on that narrative. We intend fully to commit ourselves deeply to making sure you all, as builders of these foundation models, have not only the best systems for training and inference, but the most compute so that you can keep pushing.

Our seventh generation TPU Ironwood is the first design to power t hinking and inference at scale. In fact, it is the ultimate extreme computing problem, and it’s called “inference.

” But if that’s no longer a given, it could break this renewed momentum and once again make investors question their return on AI spend. Besides the massive investments in reasoning by major players like Google, OpenAI, and Anthropic, corporate America at large has started to bet big on it too. The number of businesses that have begun adopting AI that has accelerated since the start of this year, hinging on the belief that the technology will truly transform and revolutionize their businesses.

Even if, as JPMorgan’s CEO Jamie Dimon says, “The benefit isn’t immediately clear.” AI has become table stakes for enterprises, but if the payoff doesn’t materialize, that whole premise could be rethought.

And that is why Apple’s white paper landed like cold water. Some viewed the red flags as Apple moving the goalposts, shifting the conversation because it’s playing catch up. Apple’s putting out papers right now saying that LLMS and reasoning d on’t r eally work, and what they said wasn’t entirely wrong about the challenges.

But having that come o ut and then having a WWDC like they had, and not really addressing the fact that intelligence has been a total flop, it more sounds like, “Oops, look over here, because we don’t know exactly what we’re doing while all these other MAG 7 companies are building humanoid robots, are building massive data centers in the future, are building the actual networking and GPUs.” Anthropic ended up releasing a response, another paper titled “The Illusion of the Illusion of Thinking,” raising issue with some of the technical methods Apple used to run those logic puzzles.

But the wave of researchers sounding the alarm, it’s hard to ignore. I think what A pple was doing was basically starting to set the narrative that there still is a long way to go for models to be intelligent.

And that could be the biggest curveball from all of this, pushing back the timeline to artificial general intelligence, or AGI. It’s a dream being expensively pursued by Sam Altman, SoftBank’s Masayoshi Son and Mark Zuckerberg. It has huge implications for what was once the AI partnership, widely seen as the most strategic and well-aligned in the industry: OpenAI and Microsoft.

Once OpenAI declares AGI, according to their agreement that was inked six years ago, the partnership ends, which means the definition of intelligence and who gets to call it, could determine who controls the future of AI. Reasoning was supposed to be AI’s next great leap. Instead, it may just be the step that reminds us how far we still have to go..