OpenAI’s newest AI models, o3 and o4-mini, are being hailed as cutting-edge advancements in artificial intelligence. Designed specifically for enhanced reasoning tasks, these models outperform their predecessors in many key areas, including math, coding, and logical problem-solving. However, beneath their impressive capabilities lies a persistent and growing problem—they hallucinate more than earlier models.
Hallucinations: A Growing Concern in AI
In the AI world, a “hallucination” refers to a model generating false or misleading information that it presents as factual. Despite years of improvements, hallucinations remain one of the most difficult challenges in AI development. Traditionally, each generation of models has seen gradual progress in reducing hallucinations. But the release of o3 and o4-mini appears to buck that trend.
Internal tests conducted by OpenAI show that both o3 and o4-mini hallucinate more frequently than not just their predecessors—o1, o1-mini, and o3-mini—but also than OpenAI’s more generalized models like GPT-4o. And perhaps more troubling: OpenAI isn’t entirely sure why.
Stronger Performance, But At What Cost?
In its technical documentation, OpenAI acknowledges this unexpected setback, stating that “more research is needed” to fully understand why these advanced reasoning models are hallucinating more often. The models are undoubtedly stronger in certain domains. They excel at coding tasks, complex math problems, and intricate reasoning exercises. However, their tendency to generate more output overall may be leading to a rise in both correct and incorrect claims.
For example, in one of OpenAI’s benchmarks known as PersonQA—which evaluates a model’s accuracy when answering questions about individuals—o3 hallucinated on 33% of questions. That’s more than double the hallucination rates of o1 (16%) and o3-mini (14.8%). O4-mini fared even worse, with a troubling 48% hallucination rate on the same benchmark.
When Models Make Things Up
Independent researchers have also begun scrutinizing these models, uncovering further evidence of their unpredictability. In some instances, o3 has fabricated detailed processes, such as claiming it ran code on a physical 2021 MacBook Pro “outside of ChatGPT,” then copied results into its response. While these kinds of imaginative claims might make AI sound smarter or more human-like, they’re entirely false. The model simply can’t perform such actions.
Some experts speculate that the issue might stem from the reinforcement learning techniques used in training the o-series models. These techniques are designed to enhance reasoning and decision-making but may unintentionally amplify behaviors typically mitigated during post-training fine-tuning.
Real-World Impacts of Hallucinations
While hallucinations may occasionally produce creative or novel ideas, their presence can be problematic—especially in contexts where factual accuracy is critical. In legal, medical, academic, and enterprise settings, even a single incorrect statement could lead to serious consequences.
One example of practical frustration comes from coding professionals who’ve adopted o3 into their workflows. While the model’s output is generally more capable, it has been observed to generate broken or imaginary links, which can disrupt productivity and erode trust in its reliability.
A Possible Solution? Web Search Integration
One potential way to reduce hallucinations is to give AI models access to real-time web search capabilities. OpenAI’s GPT-4o, when paired with web browsing, achieves 90% accuracy on SimpleQA, another internal benchmark that measures factual correctness. By grounding model outputs in verified, up-to-date information from the internet, hallucination rates could be reduced—though this comes with trade-offs such as exposing user prompts to third-party services.
The Road Ahead
As the AI industry pivots increasingly toward reasoning models, the hallucination problem becomes more pressing. These models show immense promise—offering higher-level cognitive abilities and stronger performance without necessarily increasing training data or compute costs. But if enhanced reasoning consistently brings with it a spike in hallucinations, developers will need to act fast to find a balance between intelligence and accuracy.
OpenAI has confirmed that addressing hallucinations remains a top priority, and ongoing research is underway to tackle the issue across all current and future models. Until then, users—especially in high-stakes environments—must tread carefully, verifying AI-generated content where possible.