According to Apple researchers, the race to develop artificial general intelligence (AGI) still has a long way to go, as leading AI models continue to struggle with reasoning.
In a June paper titled “The Illusion of Thinking,” the researchers highlighted that recent advancements in large language models (LLMs) like OpenAI’s ChatGPT and Anthropic’s Claude include large reasoning models (LRMs), but their core abilities, scaling behavior, and limitations are still not well understood.
They pointed out that current assessments mainly focus on established benchmarks in math and coding, which prioritize the accuracy of final answers.
However, these evaluations fall short of revealing the true reasoning skills of the AI systems.
This research challenges the common belief that AGI is only a few years away.
Apple researchers evaluate AI models claiming to “think”
The researchers created various puzzle games to evaluate both “thinking” and “non-thinking” versions of Claude Sonnet, OpenAI’s o3-mini and o1, and DeepSeek-R1 and V3 chatbots, going beyond standard mathematical benchmarks.
They found that leading large reasoning models (LRMs) experience a total drop in accuracy when faced with increased complexity, struggle to generalize reasoning, and lose their advantage as tasks become more complex—contradicting expectations for AGI-level performance.
“We found that LRMs have limitations in exact computation: they fail to use explicit algorithms and reason inconsistently across puzzles.”
Verification of final answers and intermediate reasoning traces (top chart), and charts showing non-thinking models are more accurate at low complexity (bottom charts). Source: Apple Machine Learning Research
Researchers say AI chatbots are overcomplicating things
They observed that the models displayed inconsistent and superficial reasoning, often arriving at correct answers initially but then overthinking and veering into incorrect conclusions.
The researchers concluded that LRMs replicate reasoning patterns without genuinely understanding or generalizing them, which falls short of achieving AGI-level reasoning.
“These insights challenge prevailing assumptions about LRM capabilities and suggest that current approaches may be encountering fundamental barriers to generalizable reasoning.”
Illustration of the four puzzle environments. Source: Apple
The pursuit of AGI
AGI is considered the ultimate goal of AI development—a stage where machines can think and reason like humans, matching human intelligence.
In January, OpenAI CEO Sam Altman stated that the company is closer than ever to achieving AGI. “We are now confident we know how to build AGI as we have traditionally understood it,” he said.
Similarly, in November, Anthropic CEO Dario Amodei predicted that AGI would surpass human capabilities within the next year or two. “If you look at the pace at which these abilities are advancing, it suggests we could reach AGI by 2026 or 2027,” he said.