Why AI can’t spell ‘strawberry’

How many times does the letter “r” appear in the word “strawberry”? According to advanced AI systems like GPT-4o and Claude, the answer is twice.

Large language models (LLMs) can churn out essays, solve complex equations in seconds, and process vast amounts of data faster than any human could. Yet, despite their extraordinary capabilities, these seemingly omniscient AIs sometimes fail in such spectacularly simple ways that their mishaps become viral memes. We chuckle and breathe a sigh of relief, reminding ourselves that perhaps we still have some time before bowing to our new AI overlords.

Just take a look at this blunder: pic.twitter.com/K2Lr9iVkjQ

— Rob DenBleyker (@RobDenBleyker) August 26, 2024

The struggle of large language models to grasp basic concepts like counting letters is a telling reminder of a larger truth we often overlook: these systems don’t think like us. They don’t think at all. They aren’t human, nor are they particularly human-like.

Most LLMs are built on transformer architecture, a type of deep learning model. Transformers don’t read text the way we do. Instead, they break down text into tokens, which can be entire words, syllables, or even individual letters, depending on the model’s design.

“LLMs operate on this transformer architecture, which isn’t actually reading text,” explained Matthew Guzdial, an AI researcher and assistant professor at the University of Alberta,“When it encounters the word ‘the,’ it processes it as a single encoding of meaning. It doesn’t perceive the individual letters ‘T,’ ‘H,’ ‘E.’”

This is because transformers are designed to handle numerical representations of text rather than the actual text itself. These numerical codes are then contextualized to generate a logical response. So while the AI might recognize the tokens “straw” and “berry” as making up “strawberry,” it may not grasp that “strawberry” is spelled with the letters “s,” “t,” “r,” “a,” “w,” “b,” “e,” “r,” “r,” and “y,” in that specific sequence. Thus, when asked how many “r”s appear in the word, it might not get it right.

Fixing this issue is no simple task, as it’s embedded deep within the architecture that powers these LLMs.

I thought Dune 2 was the best movie of 2024 until I watched this masterpiece (sound on). pic.twitter.com/W9WRhq9WuW

— Peter Yang (@petergyang) March 7, 2024

speaking with Sheridan Feucht, a PhD student at Northeastern University who studies LLM interpretability. “It’s tricky to define what exactly a ‘word’ is for a language model,” Feucht noted. “Even if human experts could agree on the perfect token vocabulary, models would still find value in breaking things down further.”

The complexity deepens as LLMs learn more languages. For instance, some tokenization methods assume that spaces always indicate new words, but languages like Chinese, Japanese, Thai, Lao, Korean, and Khmer do not use spaces to separate words. A 2023 study by Google DeepMind AI researcher Yennie Jun found that some languages require up to ten times more tokens than English to convey the same meaning.

“It might be ideal to let models process characters directly without tokenization,” Feucht added, “but that’s currently computationally impractical for transformers.”

Interestingly, image generators like Midjourney and DALL-E don’t use transformers. Instead, they rely on diffusion models, which reconstruct images from noise. These models are trained on extensive databases of images, and they’re optimized to recreate visuals similar to those in their training data.

“Image generators excel at creating artifacts like cars and human faces, but they struggle with finer details like fingers or handwriting,” said Asmelash Teka Hadgu, co-founder of Lesan and a fellow at the DAIR Institute

This difficulty likely arises because smaller details don’t appear as prominently in training sets as more general concepts, such as trees typically having green leaves. However, the issues with diffusion models might be easier to address than those affecting transformers. Some image generators have already improved their representation of hands by training on more images of real human hands.

“Not long ago, these models were notoriously bad at rendering fingers, and that’s similar to the challenges with text,” Guzdial pointed out. “They’re getting better at the details. So, if you see a hand with six or seven fingers, it may look like a finger at first glance. Similarly, with generated text, an ‘H’ might look like an ‘H,’ but the model struggles to piece these elements together correctly.”

That’s why asking an AI image generator to create a Mexican restaurant menu might yield items like “Tacos,” alongside oddities like “Tamilos,” “Enchidaa,” and “Burhiltos.”

As memes about AI’s inability to spell “strawberry” spread across the internet, OpenAI is hard at work on a new product code-named Strawberry, promising even better reasoning capabilities. The advancement of LLMs has been limited by the finite amount of training data available globally. However, Strawberry is reportedly capable of generating accurate synthetic data, potentially boosting OpenAI’s LLMs’ accuracy. According to reports from The Information, Strawberry can solve the New York Times’ Connections word puzzles, which require creative thinking and pattern recognition, and even tackle math problems it hasn’t encountered before.

Meanwhile, Google DeepMind has unveiled AlphaProof and AlphaGeometry 2, AI systems designed for formal mathematical reasoning. Google claims these systems have solved four out of six problems from the International Math Olympiad, a performance that would earn a silver medal at the prestigious competition.

It’s almost poetic that memes mocking AI for misspelling “strawberry” are trending just as OpenAI is preparing to release its new Strawberry AI. But OpenAI CEO Sam Altman seems to be enjoying the moment, sharing a bountiful harvest from his own garden.