Generative AI models are powerful tools, but they share a common flaw: they hallucinate. From Google’s Gemini to Anthropic’s Claude, and even OpenAI’s latest stealth release, GPT-4o, these models can sometimes be unreliable narrators—occasionally with amusing results, but often with serious implications.
But how do these hallucinations vary across different models? And what factors influence the kinds of errors they make? A recent study by researchers from Cornell University, the University of Washington, the University of Waterloo, and the Allen Institute for AI (AI2) aimed to answer these questions by benchmarking hallucination rates across various AI models. Their findings paint a sobering picture: even the best AI models struggle to produce factually accurate text consistently, with only about 35% of their outputs being entirely free from hallucinations.
All AI Models Hallucinate, But Some More Than Others
It’s no secret that generative AI models like GPT-4o, Meta’s Llama 3, and Cohere’s Command R+ have become essential tools in various industries. However, the study found that none of these models performed exceptionally well across all domains, such as law, health, history, and geography. Even the most advanced models only managed to generate accurate information a fraction of the time.
Interestingly, models that hallucinated less often did so not because they were more knowledgeable but because they chose not to answer questions they might have answered incorrectly. As Wenting Zhao, a doctorate student at Cornell and a co-author of the study, puts it, “The most important takeaway from our work is that we cannot yet fully trust the outputs of model generations.”
Benchmarking the Unreliable: A Tougher Test for AI
Previous studies on AI hallucinations often relied on questions with easily verifiable answers, typically sourced from Wikipedia. This approach, while useful, doesn’t reflect the more complex queries users often pose to AI models. To create a more challenging benchmark, Zhao and her team devised questions that couldn’t be answered using Wikipedia alone. These questions spanned topics as diverse as culture, finance, medicine, and even pop culture—fields where information isn’t always neatly packaged or widely available.
In this more rigorous test, over a dozen popular AI models were evaluated, including newer releases like GPT-4o and Meta’s Llama 3. The results were telling: while OpenAI’s models, including GPT-4o and the older GPT-3.5, were among the least likely to hallucinate, they still struggled with questions outside their usual training data. This suggests that many AI models are heavily reliant on sources like Wikipedia and falter when required to source information elsewhere.
The Persistent Problem of Hallucinations
Despite industry claims of reduced hallucination rates, the study’s results indicate that we haven’t seen significant improvements in this area. Models like GPT-4o and GPT-3.5 performed similarly, with only marginal differences in their ability to answer factually correct questions. Moreover, the study found that even models equipped to search the web, like Cohere’s Command R and Perplexity’s Sonar, struggled with non-Wikipedia-sourced questions, revealing a widespread issue that transcends model size and capability.
The difficulty AI models face in providing accurate information on certain topics—particularly those related to celebrities and finance—underscores a broader issue: the limitations of their training data. When tasked with answering questions in areas less represented in their training sets, the models often falter, generating less reliable outputs.
A Glimmer of Hope: Abstaining from Answers
One intriguing finding from the study was that models that abstain from answering questions could potentially reduce the rate of hallucinations. For instance, Claude 3 Haiku answered only 72% of the questions it was asked, choosing to abstain from the rest. When factoring in these abstentions, Claude 3 Haiku emerged as the most factual model, in the sense that it produced the fewest incorrect answers.
However, there’s a catch: users may be less inclined to use a model that frequently refuses to provide answers. As Zhao notes, while this approach might reduce hallucinations, it could also diminish the model’s usefulness. Instead, Zhao advocates for continued research into reducing hallucinations through methods such as human-in-the-loop fact-checking and enhanced citation during model development.
Looking Forward: The Need for Human Oversight and Better Fact-Checking Tools
Zhao emphasizes that while eliminating hallucinations entirely may not be feasible, they can be mitigated through more rigorous fact-checking and the involvement of human experts. “Policies and regulations need to be developed to ensure that human experts are always involved in the process to verify and validate the information generated by generative AI models,” she says. This approach could help to ensure that the outputs of AI models are more reliable and trustworthy.
As the AI industry continues to evolve, the findings from this study serve as a crucial reminder that, despite the progress made, we are still far from achieving truly reliable AI. The journey to reducing hallucinations is ongoing, and it will require a concerted effort from researchers, developers, and policymakers alike. By focusing on these areas, we can hope to see significant improvements in the accuracy and reliability of AI-generated content, making these models more trustworthy tools in the future.