Understanding and Benchmarking Artificial Intelligence: OpenAI's o3 Is Not AGI
Rolf Pfister, Hansueli Jud
TL;DR
The paper critiques ARC-AGI as a measure of general intelligence, arguing that its tasks and evaluation encourage skill-based exploitation rather than true generalisation. It adopts a foundational view of intelligence as the ability to create new skills for unknown conditions, contextualised by No Free Lunch theorems. The analysis shows ARC-AGI's simple problem structure and support for massive trial-and-error undermine its validity as an AGI benchmark, while highlighting the risk of Goodhart effects. It advocates a next-generation benchmark based on diverse, unknown worlds to assess intelligent problem-solving efficiency, independent of human-centric assumptions, thereby guiding progress toward robust AGI. It concludes that advancing toward AGI requires shifting emphasis from data and compute to intelligence-centric algorithm design, with a focus on abstraction and reasoning.
Abstract
OpenAI's o3 achieves a high score of 87.5 % on ARC-AGI, a benchmark proposed to measure intelligence. This raises the question whether systems based on Large Language Models (LLMs), particularly o3, demonstrate intelligence and progress towards artificial general intelligence (AGI). Building on the distinction between skills and intelligence made by François Chollet, the creator of ARC-AGI, a new understanding of intelligence is introduced: an agent is the more intelligent, the more efficiently it can achieve the more diverse goals in the more diverse worlds with the less knowledge. An analysis of the ARC-AGI benchmark shows that its tasks represent a very specific type of problem that can be solved by massive trialling of combinations of predefined operations. This method is also applied by o3, achieving its high score through the extensive use of computing power. However, for most problems in the physical world and in the human domain, solutions cannot be tested in advance and predefined operations are not available. Consequently, massive trialling of predefined operations, as o3 does, cannot be a basis for AGI - instead, new approaches are required that can reliably solve a wide variety of problems without existing skills. To support this development, a new benchmark for intelligence is outlined that covers a much higher diversity of unknown tasks to be solved, thus enabling a comprehensive assessment of intelligence and of progress towards AGI.
