The Arc Prize Foundation, a nonprofit dedicated to advancing artificial intelligence research, has introduced a groundbreaking test designed to push AI models to their cognitive limits. Named ARC-AGI-2, this new benchmark aims to evaluate an AI’s ability to think abstractly, adapt to new problems, and solve complex puzzles without relying on brute-force computation.


The results so far? A stark reality check for AI development. Even the most advanced models have struggled to achieve meaningful scores, suggesting that true artificial general intelligence (AGI) is still a distant goal.
AI Models Stumble on ARC-AGI-2
Despite the rapid progress in AI, leading models have performed dismally on ARC-AGI-2. According to the Arc Prize Foundation’s leaderboard, OpenAI’s o1-pro and DeepSeek’s R1—both designed with reasoning capabilities—scored between 1% and 1.3%. Other cutting-edge models, including GPT-4.5, Claude 3.7 Sonnet, and Gemini 2.0 Flash, hovered around just 1%.
In contrast, human participants fared significantly better. The foundation gathered data from over 400 people, forming test panels that achieved an average score of 60%—vastly outperforming AI competitors.
What Makes ARC-AGI-2 Different?
ARC-AGI-2 builds on its predecessor, ARC-AGI-1, by introducing a more refined and rigorous evaluation of intelligence. The test consists of puzzle-like problems that require AI models to recognize intricate visual patterns from colored square grids and generate correct responses. Unlike traditional benchmarks, these problems are designed to be unfamiliar, preventing models from relying on memorization or pre-existing data.
François Chollet, a prominent AI researcher and co-founder of the Arc Prize Foundation, explained that ARC-AGI-2 eliminates the shortcomings of its predecessor. A major flaw of ARC-AGI-1 was that models could leverage brute-force computation—essentially using massive processing power to derive answers rather than exhibiting true intelligence. The new iteration of the test introduces a critical new metric: efficiency.
“In intelligence, the ability to acquire and deploy capabilities efficiently is just as important as problem-solving itself,” Arc Prize Foundation co-founder Greg Kamradt emphasized in a blog post. “The key question is not just whether AI can solve a task, but how efficiently it can learn and apply new skills.”
Efficiency as the New Benchmark
One of the most revealing aspects of ARC-AGI-2 is its efficiency requirement. Unlike ARC-AGI-1, where OpenAI’s o3 model achieved a breakthrough score of 75.7%—albeit at significant computational expense—the new test demands that AI models achieve high accuracy with minimal processing cost.


To put this in perspective, the first AI model to excel at ARC-AGI-1, OpenAI’s o3 (low), needed approximately $200 worth of computing power per task. When applied to ARC-AGI-2, that same model managed only a 4% success rate.
The shift in focus from raw computational power to cost-effective intelligence aligns with growing concerns in the AI industry. As models become increasingly expensive to train and deploy, researchers and developers are seeking new ways to measure progress without unsustainable resource consumption.
The Race to 85% Accuracy: A New Challenge for AI Developers
With ARC-AGI-2 establishing itself as a formidable test, the Arc Prize Foundation has launched a new competition for 2025. The challenge? Achieve at least 85% accuracy on ARC-AGI-2 while spending no more than $0.42 per task.
This ambitious goal presents a unique opportunity for AI developers to showcase true innovation. Rather than relying on massive datasets or overwhelming computing power, the contest encourages researchers to build more efficient, adaptable, and truly intelligent systems.
As the AI industry continues its quest for artificial general intelligence, ARC-AGI-2 serves as a stark reminder: intelligence isn’t just about solving problems—it’s about how efficiently those solutions are learned and applied. With AI models struggling against this new benchmark, the road to AGI remains an ongoing challenge—one that demands not just greater power, but smarter, more adaptable thinking.