Mathematicians and Fruit

A modern LLM recently disproved an Erdős conjecture which specialists previously had been unable to crack despite decades of human effort.¹ Mathematicians around the world sat in awe as their careers flashed before their eyes, wondering if their job might be the next one to be taken down by AI. Interestingly, that same model still struggles when asked how many times the letter r appears in the word “strawberries” – sometimes telling you there are ‘two’ with full confidence.² Models which have been RLHF’ed to death have learned to always say ‘three’. In that case, you can ask for the number of rs in ‘blueberry’, and it will continue to (incorrectly) return ‘three’. Training with more parameters, more data, or more layers does not fix this. If anything, the failure mode tends to become worse as the souped up model simply wraps more elegant prose around the answer without actually changing its stance.³

The failure is structural rather than incidental because the LLM architecture has no means to perform the appropriate kind of computation for the question being asked. In the specific case of letter counting, the technical reason has to do with tokenization, but the detail matters less than the broader pattern though since the solution here is to wrap the model in a harness that contains access to a calculator. The real issue is when the mismatch between the model’s skills and the demands of the real world cannot be solved by simply handing a tool to the agent.

These systems were not trained to think for themselves in any rigorous sense. The apparent reasoning we observe is largely an emergent property that mimics reasoning rather than performs it.⁴ Shojaee and colleagues showed that Claude 3.7, DeepSeek-R1, and o3-mini all dropped to zero accuracy on Tower of Hanoi once the puzzle exceeded roughly eight disks, and the models actually spent fewer reasoning tokens as the problems grew harder even with budget remaining.⁴ On one reading, the fact that this works at all is genuinely magical. On the other hand, the magic itself is what should worry us, since we have no settled account of why it works or when it will break on any given query. A system that is confidently wrong counting letters is also the system being sold as the foundation of AGI. Given the billions of real dollars being poured in, shouldn’t such outsized claims deserve a bit more scrutiny?

The Jagged Frontier of AI

At first glance, the growth of AI looks phenomenal. Frontier models have effectively saturated MMLU and SWE-bench, with the leaderboards bunched together in the high eighties and low nineties, while GPQA still offers some discrimination between models at the top.⁵ ARC-AGI-2 scores climbed from 0% at the March 2025 launch to past 85% on frontier systems by April 2026, surpassing the threshold François Chollet originally proposed when he introduced the benchmark.⁶ Let’s not forget about METR’s benchmark which found that the task length which can be independently handled has been doubling roughly every seven months over the last six years.⁷ All these results portend an AGI future that is right around the corner.

Upon inspection though, we notice widespread benchmark contamination, with GPT-4 exposed to roughly 4.7 million samples across 263 benchmarks during their first year of release alone.⁸ When models were exposed to answers during training, leaderboards measure memorization more than generalization. ARC-AGI tasks are about moving around in a grid world, which bear no resemblance to the work agents actually need to do in production. ARC-AGI-3 continues on this toy-task trend.⁹ The caveat with the METR result is that the tasks must be operated at fifty percent reliability. Can you imagine deploying any app or service which only does the right thing 50% of the time?! At eighty percent reliability the time horizon is much shorter, and the doubling curve is considerably less flattering. A separate randomized trial from METR also found that a group of test subjects consistently predicted speed-ups with AI tooling around 20%, when in fact their progress had slowed down by that amount! This cuts against the narrative, so it didn’t make quite the splash that the other results did.¹⁰

AI capability is sharply uneven across tasks that look superficially similar. The picture worth holding in mind is a map rather than a single number on a leaderboard. Some regions sit above sea level as islands of competence, where the model performs at expert human level or better. Other regions sit underwater, where the same model fails at tasks a child could solve. The coastline between the two is irregular and impossible to predict from first principles, which is the unevenness the jagged frontier conveys. The islands surfacing in random order break conventional wisdom which predicts that harder problems must follow behind easier ones. At the moment, islands get demoed during the sales call, but real user requests in production often end up appearing below sea level.

The amount of money flowing into AI data centers and other infrastructure is enough to rebuild the entire US interstate highway system, and they largely rest on the assumption that the underwater regions will surface into islands in just a few years. Companies are cutting jobs with explicit reference to AI as the cause and to the spending it has unlocked.¹¹ MIT’s State of AI in Business report tracked hundreds of enterprise deployments, and found that 95% of Gen AI pilots delivered no measurable profit-and-loss impact, against an enterprise AI spend that the same report estimated at $30 to $40 billion.¹² Real money is being spent and real jobs cut for a return that the most rigorous measurement available cannot yet find.

Are You Thinking What I’m Thinking?

Researchers tracking the structural limits of current systems point to two architectural capabilities that LLMs lack and that scaling alone will not deliver. The first is Theory of Mind, the ability to reason about the user’s beliefs and intentions as distinct from one’s own. The standard test is the Sally-Anne false belief task, where Sally puts a marble in a basket, leaves the room, and Anne moves it to a box; the question is where Sally will look when she returns. Children pass this task around age four. On ToMBench, an evaluation framework of eight tasks covering thirty-one social cognition abilities, even GPT-4 lags adult human performance by more than ten percentage points,¹³ and Ullman’s variations show that models which pass the classic formulation often fail when minor surface details are changed, which is the signature of pattern matching rather than the underlying capability.¹⁴

Separate from tracking a belief about the user is the importance of tracking a belief about the self. At the current moment, this concept is referred to as a ‘world model’: a persistent internal representation of how the environment behaves. This state is distinct from the context window, and the agent can use this to query, update, or plan about future actions. Yann LeCun has spent several years arguing that autoregressive token prediction cannot get there, captured in his line “we need world models, not word predictors”.¹⁵ His current efforts at AMI Labs are betting on the claim that current architectures cannot get to AGI no matter how much compute they consume.¹⁶

Catastrophic forgetting wouldn’t be such a big deal if we could throw everything into the context window, but the measurements here aren’t reassuring. Studies of effective context length find that frontier models often use only a fraction of their advertised window before performance collapses, with accuracy dropping by 30%+ for long prompts. Effective context length on hard tasks is even worse.¹⁷ While the current hype cycle is on agent while-loops and token-maxxing¹⁸, the system being sold remains the same autoregressive predictor underneath, with scaffolding standing in for a shiny, new architecture. An LLM that fails at understanding users, understanding itself, and fails to learn will not generalize any more than a sufficiently advanced autocomplete.

The Show Must Go On

Model labs, hyperscalers, and analysts all share an incentive to persist the hype and keep the AGI clock running. The roughly $600 billion in hyperscaler capex committed for 2026 needs a story that justifies it, and the story has to be told until either the spending pays off or the music stops.¹⁹ Goldman Sachs’s own research shows the infrastructure-stock basket up around 44% YTD against a 9% rise in 2-year forward earnings estimates for the same group. When stock prices are rising nearly 5x faster than the earnings beneath them, you can be sure that the balloon is full of hot air.

The dot-com era promised that the internet would change everything by the turn of the century, and while this did come to fruition, the timeline was closer to 15 years than 15 months. Tesla has loudly proclaimed full self-driving (FSD) autonomy in the next few years … for every year since 2015. The reality in 2026 is geofenced robotaxis operating in a handful of cities.²⁰ Meta renamed the company around the metaverse, sold ~20 million headsets, and then decided to watch the whole business line fade into the sunset as the term barely even registered in recent earnings calls.²¹ Let’s not even get started on Web3 and crypto. The movie script is the same each time: flashy demo, scale to infinity proposed, capex needed for growth, and then it turns out the narrow case that was shown earlier is the only case that is solved.

Like the World Wide Web before it, Gen AI will produce real and durable gains. ChatGPT is not so much the iPhone moment though as it is the Crack-berry phenomenon.²² We are soon to peak in the middle of all the hype, which means the next step in the cycle is the inevitable trough of disillusionment. By staying grounded about what these systems can actually do, we can benefit through the turbulent period when the bubble decides to give back some of its altitude on the way to the plateau where the real value finally gets built.

AI Just Solved an 80-Year-Old ‘Erdős Problem,’ and Mathematicians Are Amazed, Scientific American ↩
The ‘strawberry’ problem: How to overcome AI’s limitations, VentureBeat; The standard ‘strawberry’ has been fixed in 2026, but the example above uses ‘strawberries’ instead. ↩
Especially on complex topics outside your own domain, the model’s answer is given with such confidence that you start to question your own conviction. ↩
(Shojaee et al., 2025) The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity, Apple Machine Learning Research ↩ ↩²
State of LLM Benchmarks 2026: Rankings, Trends, and What Actually Changed, BenchLM ↩
(Chollet et al., 2026) ARC Prize 2025: Technical Report ↩
(METR, 2025) Measuring AI Ability to Complete Long Tasks ↩
(Balloccu et al., 2024) Leak, Cheat, Repeat: Data Contamination and Evaluation Malpractices in Closed-Source LLMs ↩
ARC-AGI In 2026: Why Frontier Models Still Don’t Generalize, Adaline Labs ↩
(METR, 2025) Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity ↩
Tech layoffs have already passed 100,000 in 2026 as the industry cuts jobs to fund AI, TechSpot ↩
MIT report: 95% of generative AI pilots at companies are failing, Fortune ↩
(Chen et al., 2024) ToMBench: Benchmarking Theory of Mind in Large Language Models ↩
Dissecting the Ullman Variations with a SCALPEL: Why do LLMs fail at Trivial Alterations to the False Belief Task? ↩
(LeCun, 2022) A Path Towards Autonomous Machine Intelligence ↩
Yann LeCun’s AMI Raises $1BN Seed Round, Is the World Model Era Finally Here?, Futurum Group ↩
Why Your LLM Only Uses 10-20% of Its Context Window (And How TITANS Fixes It), rewire.it ↩
Stop ‘tokenmaxxing’ and deploy AI sensibly instead, Nature Machine Intelligence ↩
Tracking Trillions: The Assumptions Shaping the Scale of the AI Build-Out, Goldman Sachs ↩
List of predictions for autonomous Tesla vehicles by Elon Musk, Wikipedia ↩
Meta Has Sold Nearly 20 Million Quest Headsets, But Retention Struggles Remain, Road to VR ↩
Crackberry: 2006 Word of the Year, CrackBerry.com. The term captured the addictive peak of BlackBerry adoption a year before the iPhone began the displacement that would erase BlackBerry’s market share over the next several years. ↩