Debates over AI benchmarks — and the way companies present them — are erupting across the tech world, with accusations of misrepresentation flying between some of the industry’s biggest players. This week, a public spat emerged when an OpenAI employee accused Elon Musk’s AI startup, xAI, of publishing misleading performance results for its latest language model, Grok 3. In response, Igor Babushkin, one of xAI’s co-founders, defended the company’s claims, asserting that their data presentation was valid. But as with most controversies, the truth lies somewhere in the middle.
Let’s break it all down.
The Benchmark That Sparked the Debate
In a post on xAI’s official blog, the company showcased a graph illustrating Grok 3’s performance on AIME 2025 — a collection of complex mathematical questions derived from a prestigious invitational exam. While some researchers have criticized AIME as a flawed benchmark for assessing general intelligence, it remains widely used to evaluate an AI model’s mathematical reasoning abilities.
According to xAI’s graph, Grok 3 Reasoning Beta and Grok 3 mini Reasoning outperformed OpenAI’s best publicly available model, o3-mini-high, on AIME 2025. The implication? Grok 3 was smarter — at least when it came to solving challenging math problems. But it didn’t take long for OpenAI employees to challenge the validity of these claims on social media.
What Is “cons@64,” and Why Does It Matter?
The crux of the dispute revolves around something called “consensus@64” (or cons@64). This metric allows a model 64 attempts to solve each problem, ultimately selecting the most frequent response as its final answer. As you might imagine, this process can significantly boost a model’s score by ironing out random mistakes and capitalizing on the model’s probabilistic nature.
The catch? xAI’s graph didn’t include OpenAI’s o3-mini-high score at cons@64 — a notable omission. Without that data point, the graph made it seem like Grok 3 was outperforming OpenAI’s models across the board. But when comparing the models’ performance at “@1” (the score from the first attempt), both Grok 3 Reasoning Beta and Grok 3 mini Reasoning fell short of o3-mini-high. Grok 3 Reasoning Beta even trailed slightly behind OpenAI’s o1 model at its “medium” computing setting.
Despite this, xAI boldly advertised Grok 3 as the “world’s smartest AI.”
xAI’s Defense — and the Counterarguments
Igor Babushkin, xAI’s co-founder, pushed back against the criticism, claiming that OpenAI has been guilty of similarly selective data presentations in the past — though usually when comparing its own internal models. In Babushkin’s view, the competitive nature of the AI industry means everyone is prone to a little bit of selective framing.
Meanwhile, an independent AI researcher attempted to settle the matter by compiling a more comprehensive graph that included nearly every model’s performance at cons@64. The result? A much more nuanced picture, showing that while Grok 3 did hold its own, the outright dominance suggested by xAI’s original graph was overstated.
“Hilarious how some people see my plot as an attack on OpenAI and others as an attack on Grok,” wrote the researcher, who goes by the handle @teortaxesTex, in a viral post. “In reality, it’s DeepSeek propaganda. (I actually believe Grok looks good there, and OpenAI’s TTC chicanery behind o3-mini-high-pass@‘1’ deserves more scrutiny.)”
The Missing Piece: Computational and Financial Costs
As the debate raged on, AI researcher Nathan Lambert highlighted a critical piece of the puzzle that remains opaque: the computational — and monetary — cost required for each model to achieve its best score. After all, a model that achieves state-of-the-art results but requires astronomical resources to do so might be less practical than a slightly less capable model that runs efficiently.
This underscores one of the core problems with AI benchmarks: while they can offer valuable insights into a model’s capabilities, they rarely tell the whole story. Factors like inference speed, generalization ability, and real-world applicability are often overshadowed by headline-grabbing leaderboard scores.
The Bigger Picture: Transparency and Accountability in AI
The xAI-Grok 3 controversy is just the latest example of how opaque the AI industry’s benchmarking practices can be. With companies racing to outdo one another in what sometimes feels like an AI arms race, the pressure to present models in the best possible light is intense. But selective data presentation — even when technically accurate — can erode public trust and make it harder for researchers and developers to assess a model’s true capabilities.
If anything, this incident highlights the need for more standardized, transparent benchmarking practices across the AI landscape. Clearer reporting of metrics like cons@64, more context around computational costs, and a commitment to showing the good, the bad, and the ugly of each model’s performance would go a long way toward fostering a more honest and collaborative AI ecosystem.
For now, the rivalry between xAI and OpenAI is sure to continue heating up. But as flashy marketing claims and Twitter spats grab headlines, one question remains: How much do these benchmarks really tell us — and how much do they obscure?