By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
Times CatalogTimes CatalogTimes Catalog
  • Home
  • Tech
    • Google
    • Microsoft
    • YouTube
    • Twitter
  • News
  • How To
  • Bookmarks
Search
Technology
  • Meta
Others
  • Apple
  • WhatsApp
  • Elon Musk
  • Threads
  • About
  • Contact
  • Privacy Policy and Disclaimer
© 2025 Times Catalog
Reading: Study suggests that even the best AI models hallucinate a bunch
Share
Notification
Font ResizerAa
Font ResizerAa
Times CatalogTimes Catalog
Search
  • News
  • How To
  • Tech
    • AI
    • Apple
    • Microsoft
    • Google
    • ChatGPT
    • Gemini
    • YouTube
    • Twitter
  • Coming Soon
Follow US
  • About
  • Contact
  • Privacy Policy and Disclaimer
© 2025 Times Catalog
Times Catalog > Blog > Tech > AI > Study suggests that even the best AI models hallucinate a bunch
AI

Study suggests that even the best AI models hallucinate a bunch

Usama
Last updated: August 14, 2024 7:32 pm
Usama
Share
6 Min Read
Study suggests that even the best AI models hallucinate a bunch
SHARE

Generative AI models are powerful tools, but they share a common flaw: they hallucinate. From Google’s Gemini to Anthropic’s Claude, and even OpenAI’s latest stealth release, GPT-4o, these models can sometimes be unreliable narrators—occasionally with amusing results, but often with serious implications.

Contents
All AI Models Hallucinate, But Some More Than OthersBenchmarking the Unreliable: A Tougher Test for AIThe Persistent Problem of HallucinationsA Glimmer of Hope: Abstaining from AnswersLooking Forward: The Need for Human Oversight and Better Fact-Checking Tools

But how do these hallucinations vary across different models? And what factors influence the kinds of errors they make? A recent study by researchers from Cornell University, the University of Washington, the University of Waterloo, and the Allen Institute for AI (AI2) aimed to answer these questions by benchmarking hallucination rates across various AI models. Their findings paint a sobering picture: even the best AI models struggle to produce factually accurate text consistently, with only about 35% of their outputs being entirely free from hallucinations.

All AI Models Hallucinate, But Some More Than Others

It’s no secret that generative AI models like GPT-4o, Meta’s Llama 3, and Cohere’s Command R+ have become essential tools in various industries. However, the study found that none of these models performed exceptionally well across all domains, such as law, health, history, and geography. Even the most advanced models only managed to generate accurate information a fraction of the time.

Interestingly, models that hallucinated less often did so not because they were more knowledgeable but because they chose not to answer questions they might have answered incorrectly. As Wenting Zhao, a doctorate student at Cornell and a co-author of the study, puts it, “The most important takeaway from our work is that we cannot yet fully trust the outputs of model generations.”

Benchmarking the Unreliable: A Tougher Test for AI

Previous studies on AI hallucinations often relied on questions with easily verifiable answers, typically sourced from Wikipedia. This approach, while useful, doesn’t reflect the more complex queries users often pose to AI models. To create a more challenging benchmark, Zhao and her team devised questions that couldn’t be answered using Wikipedia alone. These questions spanned topics as diverse as culture, finance, medicine, and even pop culture—fields where information isn’t always neatly packaged or widely available.

In this more rigorous test, over a dozen popular AI models were evaluated, including newer releases like GPT-4o and Meta’s Llama 3. The results were telling: while OpenAI’s models, including GPT-4o and the older GPT-3.5, were among the least likely to hallucinate, they still struggled with questions outside their usual training data. This suggests that many AI models are heavily reliant on sources like Wikipedia and falter when required to source information elsewhere.

The Persistent Problem of Hallucinations

Despite industry claims of reduced hallucination rates, the study’s results indicate that we haven’t seen significant improvements in this area. Models like GPT-4o and GPT-3.5 performed similarly, with only marginal differences in their ability to answer factually correct questions. Moreover, the study found that even models equipped to search the web, like Cohere’s Command R and Perplexity’s Sonar, struggled with non-Wikipedia-sourced questions, revealing a widespread issue that transcends model size and capability.

The difficulty AI models face in providing accurate information on certain topics—particularly those related to celebrities and finance—underscores a broader issue: the limitations of their training data. When tasked with answering questions in areas less represented in their training sets, the models often falter, generating less reliable outputs.

A Glimmer of Hope: Abstaining from Answers

One intriguing finding from the study was that models that abstain from answering questions could potentially reduce the rate of hallucinations. For instance, Claude 3 Haiku answered only 72% of the questions it was asked, choosing to abstain from the rest. When factoring in these abstentions, Claude 3 Haiku emerged as the most factual model, in the sense that it produced the fewest incorrect answers.

However, there’s a catch: users may be less inclined to use a model that frequently refuses to provide answers. As Zhao notes, while this approach might reduce hallucinations, it could also diminish the model’s usefulness. Instead, Zhao advocates for continued research into reducing hallucinations through methods such as human-in-the-loop fact-checking and enhanced citation during model development.

Looking Forward: The Need for Human Oversight and Better Fact-Checking Tools

Zhao emphasizes that while eliminating hallucinations entirely may not be feasible, they can be mitigated through more rigorous fact-checking and the involvement of human experts. “Policies and regulations need to be developed to ensure that human experts are always involved in the process to verify and validate the information generated by generative AI models,” she says. This approach could help to ensure that the outputs of AI models are more reliable and trustworthy.

As the AI industry continues to evolve, the findings from this study serve as a crucial reminder that, despite the progress made, we are still far from achieving truly reliable AI. The journey to reducing hallucinations is ongoing, and it will require a concerted effort from researchers, developers, and policymakers alike. By focusing on these areas, we can hope to see significant improvements in the accuracy and reliability of AI-generated content, making these models more trustworthy tools in the future.

You Might Also Like

ChatGPT search is growing quickly in Europe, OpenAI data suggests

Google is trying to get college students hooked on AI with a free year of Gemini Advanced

ChatGPT will now use its ‘memory’ to personalize web searches

ChatGPT is referring to users by their names unprompted, and some find it ‘creepy’

OpenAI’s new reasoning AI models hallucinate more

Share This Article
Facebook Twitter Pinterest Whatsapp Whatsapp Copy Link
What do you think?
Love0
Happy0
Sad0
Sleepy0
Angry0
Previous Article Apple is finally going to open up iPhone tap-to-pay Apple is finally going to open up iPhone tap-to-pay
Next Article Google just dropped a new AI camera update — and you don’t need a new Pixel 9 for it Google just dropped a new AI camera update — and you don’t need a new Pixel 9 for it
Leave a comment Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Stay Connected

144FollowersLike
23FollowersFollow
237FollowersPin
19FollowersFollow

Latest News

Logitech’s MX Creative Console now supports Figma and Adobe Lightroom
Logitech’s MX Creative Console now supports Figma and Adobe Lightroom
Apps News Tech April 23, 2025
Samsung resumes its troubled One UI 7 rollout
Samsung resumes its troubled One UI 7 rollout
Google News Samsung Tech April 23, 2025
Google Messages starts rolling out sensitive content warnings for nude images
Google Messages starts rolling out sensitive content warnings for nude images
Apps News Tech April 22, 2025
Vivo wants its new smartphone to replace your camera
Vivo wants its new smartphone to replace your camera
News Tech April 22, 2025
Times CatalogTimes Catalog
Follow US
© 2025 Times Catalog
  • About
  • Contact
  • Privacy Policy and Disclaimer
Welcome Back!

Sign in to your account

Lost your password?