If you’ve ever attempted to use ChatGPT or similar AI models as a calculator, you’ve likely run into their frequent mathematical missteps. It’s not just a quirk of OpenAI’s chatbot; many state-of-the-art AI systems struggle with basic math. Anthropic’s Claude stumbles on simple word problems. Google’s Gemini has a hard time grasping quadratic equations. And Meta’s LLaMA? It finds even basic addition challenging.
How is it possible that these advanced AI models, capable of composing essays and mimicking Shakespearean soliloquies, still trip over grade-school-level arithmetic?
The answer is multifaceted, but let’s start with one of the key culprits: tokenization.
Tokenization: Where Numbers Break Down
At the heart of any language model is a process called tokenization, which breaks down data into smaller chunks, or tokens. For words, this might mean splitting something like “fantastic” into “fan,” “tas,” and “tic.” This approach is highly effective for understanding language in a compressed way, allowing AI to store and retrieve linguistic information efficiently.
However, numbers pose a unique problem. Tokenizers often struggle to grasp numbers as continuous entities. For example, while “380” might be treated as a single token, the number “381” could be split into “38” and “1.” This seemingly minor issue can distort how AI models handle mathematical relationships. As a result, basic math operations, which rely on maintaining the integrity of numbers, often go awry.
But tokenization isn’t the only challenge.
AI Models: Statistical Machines, Not Mathematicians
At their core, AI models like ChatGPT are statistical prediction engines. Trained on vast amounts of data, they learn patterns to generate likely outcomes based on input. When crafting sentences or predicting phrases like “to whom it may concern,” they excel because language follows recognizable patterns.
Mathematics, on the other hand, requires precise, step-by-step reasoning. Let’s say you ask ChatGPT to solve the multiplication problem 5,7897 × 1,2832. The model might infer that multiplying numbers ending in “7” and “2” will result in a product ending in “4.” But the steps in between? That’s where things get fuzzy.
In one test, ChatGPT returned the answer as 742,021,104, when the correct result was actually 742,934,304. It wasn’t even close.
Benchmarking AI’s Math Skills
Dr. Yuntian Deng, an assistant professor at the University of Waterloo and an expert in AI, recently published a comprehensive study on ChatGPT’s multiplication abilities. His findings were revealing: GPT-4o, the default model, struggled significantly with multi-digit multiplication problems, achieving less than 30% accuracy for numbers with more than four digits.
“Multi-digit multiplication is challenging for language models because a mistake in any intermediate step can compound, leading to incorrect final results,” Deng told TechCrunch.
For example, multiplying 3,459 by 5,284 was enough to stump GPT-4o, and the more digits involved, the worse it performed.
OpenAI’s o1: A Glimmer of Hope
But there’s hope. In the same study, Deng also tested OpenAI’s newer “reasoning” model, o1, which approaches problems in a more methodical, step-by-step manner before offering a solution. The results were promising. Unlike GPT-4o, the o1 model successfully handled up to nine-digit by nine-digit multiplication problems with approximately 50% accuracy.
“The model might be solving the problem in ways that differ from how we solve it manually,” Deng said. “It makes us curious about the model’s internal approach and how it differs from human reasoning.”
While far from perfect, o1’s performance marks significant progress over its predecessor, suggesting that ChatGPT-like systems might eventually master at least some types of mathematical challenges.
The Future of Math in AI: Will It Ever Be as Good as a Calculator?
Deng is optimistic. He believes certain types of math problems — particularly those with clear, defined steps like multiplication — could eventually be fully solvable by AI models. “This is a well-defined task with known algorithms,” he said. “We’re already seeing significant improvements from GPT-4o to o1, so it’s clear that enhancements in reasoning capabilities are happening.”
That said, don’t toss out your calculator just yet. While AI is making strides, it’s still far from matching the precision and reliability of dedicated mathematical tools. Even with models like o1 showing promise, AI is still learning the ropes of complex arithmetic.
Why AI Arithmetic Isn’t the Same as Human Arithmetic
One of the more intriguing aspects of this AI math issue is that the way models like ChatGPT approach math might be fundamentally different from how humans do. Instead of going step-by-step through a logical process like we do when solving problems manually, AI relies heavily on recognizing patterns it has encountered during training.
For problems that fall outside these patterns or require a degree of creative reasoning, AI can quickly falter. This is especially true for math, where a single error can snowball into an incorrect solution. In contrast, humans can usually catch and correct mistakes mid-process.
Conclusion: Progress, but Patience Required
As AI systems continue to evolve, their math capabilities will undoubtedly improve. The development of reasoning models like o1 shows that AI is gradually learning to “think” through problems, bringing it closer to mastering complex tasks like multi-digit multiplication. But for now, expect some stumbles along the way.
So, while ChatGPT might one day become as proficient with numbers as a high-end calculator, we’re not there yet. In the meantime, for reliable math solutions, your best bet is to stick with tried-and-true tools — or double-check your chatbot’s answers.
After all, it’s still learning.