Artificial intelligence has made impressive strides in fields like coding, content creation, and even podcast generation. However, when it comes to mastering complex historical knowledge, AI still has a long way to go. A new study sheds light on just how far large language models (LLMs) need to improve before they can pass advanced history exams. The findings, presented last month at the prestigious NeurIPS AI conference, reveal significant limitations in three leading AI models: OpenAI’s GPT-4, Meta’s Llama, and Google’s Gemini.
Testing AI’s Historical IQ: Introducing Hist-LLM
To measure how well these models handle history, researchers created a specialized benchmark called Hist-LLM. This tool evaluates the accuracy of AI responses using the Seshat Global History Databank, a comprehensive database of historical facts and patterns named after the ancient Egyptian goddess of wisdom. Hist-LLM is designed to go beyond surface-level trivia, diving into nuanced historical questions that require deeper understanding.
The results? Disappointing, to say the least. The best-performing model, GPT-4 Turbo, managed only 46% accuracy—barely better than random guessing. For Maria del Rio-Chanona, a co-author of the study and associate professor of computer science at University College London, the findings are clear: “The main takeaway from this study is that LLMs, while impressive, still lack the depth of understanding required for advanced history. They’re great for basic facts, but when it comes to more nuanced, PhD-level historical inquiry, they’re not yet up to the task.”
Historical Blind Spots: Where AI Goes Wrong
The study highlights several examples of how these models falter. One glaring error involved a question about the presence of scale armor in ancient Egypt during a specific period. GPT-4 Turbo confidently answered yes, but historians know the technology didn’t appear in Egypt until 1,500 years later. Another example involved the existence of a professional standing army in ancient Egypt. While the correct answer for a particular historical period is no, GPT-4 incorrectly stated that such an army existed, likely conflating Egypt with other ancient empires like Persia.
So why do LLMs stumble over these technical historical questions? According to del Rio-Chanona, the issue lies in how AI models process and prioritize data. “If you get told A and B 100 times, and C only once, and then get asked a question about C, you might just remember A and B and try to extrapolate from that,” she explained. In other words, LLMs struggle with obscure or less prominent historical information, often defaulting to broader trends or more widely available data.
Regional Biases and Underrepresented Histories
The study also uncovered troubling trends regarding regional biases in AI performance. OpenAI’s GPT-4 and Meta’s Llama models, for instance, performed worse on historical questions about sub-Saharan Africa. This suggests that their training data may underrepresent certain regions, perpetuating knowledge gaps and biases.
Peter Turchin, who led the study and is a faculty member at the Complexity Science Hub (CSH) in Austria, noted that these limitations underscore why LLMs cannot yet replace human expertise in fields like history. “The results show that LLMs still aren’t a substitute for humans when it comes to certain domains,” he said.
Why AI Shines in Coding but Stumbles in History
The disparity between AI’s prowess in technical fields like coding and its struggles with history is striking. Unlike historical data, which is often fragmented and context-dependent, coding relies on structured logic and a well-documented knowledge base. LLMs thrive in such environments, where clear patterns and rules guide their outputs. History, on the other hand, demands a nuanced understanding of context, culture, and causality—areas where AI still falls short.
A Glimmer of Hope: Refining AI for Historical Research
Despite these shortcomings, the researchers remain optimistic about AI’s potential to assist historians. They are working on improving the Hist-LLM benchmark by incorporating more data from underrepresented regions and designing increasingly complex questions.
“Overall, while our results highlight areas where LLMs need improvement, they also underscore the potential for these models to aid in historical research,” the paper reads. With further development, LLMs could serve as valuable tools for historians, offering insights and connections that might otherwise go unnoticed.
The Road Ahead for AI in History
This study serves as a reminder that while AI is a powerful tool, it’s not infallible. For tasks requiring deep contextual knowledge and critical thinking, human expertise remains irreplaceable. However, as researchers refine benchmarks like Hist-LLM and address biases in training data, we may one day see AI models that can complement, rather than compete with, human historians.
For now, though, the verdict is clear: when it comes to understanding the complexities of history, AI has plenty of homework to do.