By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
Times CatalogTimes CatalogTimes Catalog
  • Home
  • Tech
    • Google
    • Microsoft
    • YouTube
    • Twitter
  • News
  • How To
  • Bookmarks
Search
Technology
  • Meta
Others
  • Apple
  • WhatsApp
  • Elon Musk
  • Threads
  • About
  • Contact
  • Privacy Policy and Disclaimer
© 2025 Times Catalog
Reading: Did xAI lie about Grok 3’s benchmarks?
Share
Notification
Font ResizerAa
Font ResizerAa
Times CatalogTimes Catalog
Search
  • News
  • How To
  • Tech
    • AI
    • Apple
    • Microsoft
    • Google
    • ChatGPT
    • Gemini
    • YouTube
    • Twitter
  • Coming Soon
Follow US
  • About
  • Contact
  • Privacy Policy and Disclaimer
© 2025 Times Catalog
Times Catalog > Blog > Tech > AI > Did xAI lie about Grok 3’s benchmarks?
AINewsOpenAITech

Did xAI lie about Grok 3’s benchmarks?

Usama
Last updated: February 23, 2025 5:10 pm
Usama
Share
6 Min Read
Did xAI lie about Grok 3’s benchmarks?
SHARE

Debates over AI benchmarks — and the way companies present them — are erupting across the tech world, with accusations of misrepresentation flying between some of the industry’s biggest players. This week, a public spat emerged when an OpenAI employee accused Elon Musk’s AI startup, xAI, of publishing misleading performance results for its latest language model, Grok 3. In response, Igor Babushkin, one of xAI’s co-founders, defended the company’s claims, asserting that their data presentation was valid. But as with most controversies, the truth lies somewhere in the middle.

Contents
The Benchmark That Sparked the DebateWhat Is “cons@64,” and Why Does It Matter?xAI’s Defense — and the CounterargumentsThe Missing Piece: Computational and Financial CostsThe Bigger Picture: Transparency and Accountability in AI

Let’s break it all down.

The Benchmark That Sparked the Debate

In a post on xAI’s official blog, the company showcased a graph illustrating Grok 3’s performance on AIME 2025 — a collection of complex mathematical questions derived from a prestigious invitational exam. While some researchers have criticized AIME as a flawed benchmark for assessing general intelligence, it remains widely used to evaluate an AI model’s mathematical reasoning abilities.

According to xAI’s graph, Grok 3 Reasoning Beta and Grok 3 mini Reasoning outperformed OpenAI’s best publicly available model, o3-mini-high, on AIME 2025. The implication? Grok 3 was smarter — at least when it came to solving challenging math problems. But it didn’t take long for OpenAI employees to challenge the validity of these claims on social media.

What Is “cons@64,” and Why Does It Matter?

The crux of the dispute revolves around something called “consensus@64” (or cons@64). This metric allows a model 64 attempts to solve each problem, ultimately selecting the most frequent response as its final answer. As you might imagine, this process can significantly boost a model’s score by ironing out random mistakes and capitalizing on the model’s probabilistic nature.

The catch? xAI’s graph didn’t include OpenAI’s o3-mini-high score at cons@64 — a notable omission. Without that data point, the graph made it seem like Grok 3 was outperforming OpenAI’s models across the board. But when comparing the models’ performance at “@1” (the score from the first attempt), both Grok 3 Reasoning Beta and Grok 3 mini Reasoning fell short of o3-mini-high. Grok 3 Reasoning Beta even trailed slightly behind OpenAI’s o1 model at its “medium” computing setting.

Despite this, xAI boldly advertised Grok 3 as the “world’s smartest AI.”

xAI’s Defense — and the Counterarguments

Igor Babushkin, xAI’s co-founder, pushed back against the criticism, claiming that OpenAI has been guilty of similarly selective data presentations in the past — though usually when comparing its own internal models. In Babushkin’s view, the competitive nature of the AI industry means everyone is prone to a little bit of selective framing.

Meanwhile, an independent AI researcher attempted to settle the matter by compiling a more comprehensive graph that included nearly every model’s performance at cons@64. The result? A much more nuanced picture, showing that while Grok 3 did hold its own, the outright dominance suggested by xAI’s original graph was overstated.

“Hilarious how some people see my plot as an attack on OpenAI and others as an attack on Grok,” wrote the researcher, who goes by the handle @teortaxesTex, in a viral post. “In reality, it’s DeepSeek propaganda. (I actually believe Grok looks good there, and OpenAI’s TTC chicanery behind o3-mini-high-pass@‘1’ deserves more scrutiny.)”

The Missing Piece: Computational and Financial Costs

As the debate raged on, AI researcher Nathan Lambert highlighted a critical piece of the puzzle that remains opaque: the computational — and monetary — cost required for each model to achieve its best score. After all, a model that achieves state-of-the-art results but requires astronomical resources to do so might be less practical than a slightly less capable model that runs efficiently.

This underscores one of the core problems with AI benchmarks: while they can offer valuable insights into a model’s capabilities, they rarely tell the whole story. Factors like inference speed, generalization ability, and real-world applicability are often overshadowed by headline-grabbing leaderboard scores.

The Bigger Picture: Transparency and Accountability in AI

The xAI-Grok 3 controversy is just the latest example of how opaque the AI industry’s benchmarking practices can be. With companies racing to outdo one another in what sometimes feels like an AI arms race, the pressure to present models in the best possible light is intense. But selective data presentation — even when technically accurate — can erode public trust and make it harder for researchers and developers to assess a model’s true capabilities.

If anything, this incident highlights the need for more standardized, transparent benchmarking practices across the AI landscape. Clearer reporting of metrics like cons@64, more context around computational costs, and a commitment to showing the good, the bad, and the ugly of each model’s performance would go a long way toward fostering a more honest and collaborative AI ecosystem.

For now, the rivalry between xAI and OpenAI is sure to continue heating up. But as flashy marketing claims and Twitter spats grab headlines, one question remains: How much do these benchmarks really tell us — and how much do they obscure?

You Might Also Like

Logitech’s MX Creative Console now supports Figma and Adobe Lightroom

Samsung resumes its troubled One UI 7 rollout

Google Messages starts rolling out sensitive content warnings for nude images

Vivo wants its new smartphone to replace your camera

Uber users can now earn miles with Delta Air Lines

Share This Article
Facebook Twitter Pinterest Whatsapp Whatsapp Copy Link
What do you think?
Love0
Happy0
Sad0
Sleepy0
Angry0
Previous Article Asus is making a ‘Fragrance Mouse,’ and it’s coming to the US Asus is making a ‘Fragrance Mouse,’ and it’s coming to the US
Next Article US AI Safety Institute could face big cuts US AI Safety Institute could face big cuts
Leave a comment Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Stay Connected

144FollowersLike
23FollowersFollow
237FollowersPin
19FollowersFollow

Latest News

Pinterest is prompting teens to close the app at school
Pinterest is prompting teens to close the app at school
News Tech April 22, 2025
ChatGPT search is growing quickly in Europe, OpenAI data suggests
ChatGPT search is growing quickly in Europe, OpenAI data suggests
AI ChatGPT OpenAI April 22, 2025
social-media-is-not-wholly-terrible-for-teen-mental-health-study-says
Social media is not wholly terrible for teen mental health, study says
News April 22, 2025
Google is trying to get college students hooked on AI with a free year of Gemini Advanced
Google is trying to get college students hooked on AI with a free year of Gemini Advanced
AI Gemini Google Tech April 19, 2025
Times CatalogTimes Catalog
Follow US
© 2025 Times Catalog
  • About
  • Contact
  • Privacy Policy and Disclaimer
Welcome Back!

Sign in to your account

Lost your password?