Table of Contents
Fetching ...

The ARC of Progress towards AGI: A Living Survey of Abstraction and Reasoning

Sahar Vahdati, Andrei Aioanei, Haridhra Suresh, Jens Lehmann

Abstract

The Abstraction and Reasoning Corpus (ARC-AGI) has become a key benchmark for fluid intelligence in AI. This survey presents the first cross-generation analysis of 82 approaches across three benchmark versions and the ARC Prize 2024-2025 competitions. Our central finding is that performance degradation across versions is consistent across all paradigms: program synthesis, neuro-symbolic, and neural approaches all exhibit 2-3x drops from ARC-AGI-1 to ARC-AGI-2, indicating fundamental limitations in compositional generalization. While systems now reach 93.0% on ARC-AGI-1 (Opus 4.6), performance falls to 68.8% on ARC-AGI-2 and 13% on ARC-AGI-3, as humans maintain near-perfect accuracy across all versions. Cost fell 390x in one year (o3's $4,500/task to GPT-5.2's $12/task), although this largely reflects reduced test-time parallelism. Trillion-scale models vary widely in score and cost, while Kaggle-constrained entries (660M-8B) achieve competitive results, aligning with Chollet's thesis that intelligence is skill-acquisition efficiency. Test-time adaptation and refinement loops emerge as critical success factors, while compositional reasoning and interactive learning remain unsolved. ARC Prize 2025 winners needed hundreds of thousands of synthetic examples to reach 24% on ARC-AGI-2, confirming that reasoning remains knowledge-bound. This first release of the ARC-AGI Living Survey captures the field as of February 2026, with updates at https://nimi-ai.com/arc-survey/

The ARC of Progress towards AGI: A Living Survey of Abstraction and Reasoning

Abstract

The Abstraction and Reasoning Corpus (ARC-AGI) has become a key benchmark for fluid intelligence in AI. This survey presents the first cross-generation analysis of 82 approaches across three benchmark versions and the ARC Prize 2024-2025 competitions. Our central finding is that performance degradation across versions is consistent across all paradigms: program synthesis, neuro-symbolic, and neural approaches all exhibit 2-3x drops from ARC-AGI-1 to ARC-AGI-2, indicating fundamental limitations in compositional generalization. While systems now reach 93.0% on ARC-AGI-1 (Opus 4.6), performance falls to 68.8% on ARC-AGI-2 and 13% on ARC-AGI-3, as humans maintain near-perfect accuracy across all versions. Cost fell 390x in one year (o3's 12/task), although this largely reflects reduced test-time parallelism. Trillion-scale models vary widely in score and cost, while Kaggle-constrained entries (660M-8B) achieve competitive results, aligning with Chollet's thesis that intelligence is skill-acquisition efficiency. Test-time adaptation and refinement loops emerge as critical success factors, while compositional reasoning and interactive learning remain unsolved. ARC Prize 2025 winners needed hundreds of thousands of synthetic examples to reach 24% on ARC-AGI-2, confirming that reasoning remains knowledge-bound. This first release of the ARC-AGI Living Survey captures the field as of February 2026, with updates at https://nimi-ai.com/arc-survey/
Paper Structure (44 sections, 12 figures, 9 tables)

This paper contains 44 sections, 12 figures, 9 tables.

Figures (12)

  • Figure 1: Example task from each ARC-AGI version and the performance cliff across them. On the left side, we show one example from each ARC-AGI version, illustrating the increased complexity and requirements. On the right side, horizontal stacked bars show best AI performance (darker color) and the gap to human baseline (lighter color) for each benchmark version.
  • Figure 2: Representative ARC-AGI tasks illustrating two core reasoning categories. Left: Object-centric reasoning (Task f76d97a5, 3 training examples). The transformation extracts the colored checkerboard pattern from a gray background, requiring the system to identify "object" versus "background" without explicit segmentation cues. Right: Geometric transformation (Task c97c0139, 2 training examples). Red line segments define reflection axes around which cyan diamond shapes must be generated symmetrically. Both tasks require inferring abstract rules from minimal demonstrations and generalizing to novel configurations.
  • Figure 3: ARC-AGI performance across benchmark versions. Public leaderboard (solid bars) allows unconstrained compute and API access; Kaggle competition (hatched bars) is constrained to $50 compute budget with no internet. Arrows show gap to human baseline. The human baseline of 100% represents task-level solvability (every task solved by at least one person); average individual accuracy is 76% on ARC-AGI-1 and 60% on ARC-AGI-2. Key findings: (1) Land's cross-model ensemble land2025_arc_solver sets public SOTA at 94.5% (ARC-AGI-1) and 72.9% (ARC-AGI-2); (2) Opus 4.6 nearly matches Land at 93.0% and 68.8% respectively, at a fraction of the cost ($1.88 and $3.64/task vs. $11.40 and $38.90/task); (3) Public scores exceed Kaggle by 30--70 percentage points, reflecting unconstrained vs. constrained compute regimes; (4) The performance cliff persists across all systems: even the best public entry drops 23% from ARC-AGI-1 to ARC-AGI-2; (5) Kaggle winners achieve better cost-efficiency: NVARC scores 24% at $0.20/task (120 pts/$) vs. Land's 72.9% at $38.90/task (2 pts/$), a 60$\times$ efficiency gap.
  • Figure 4: Cross-generation performance cliff: seven systems evaluated on both ARC-AGI-1 and ARC-AGI-2. Land's cross-model ensemble land2025_arc_solver achieves the highest scores on both benchmarks but still drops 23%. Notably, Opus 4.6 shows the smallest single-model drop ($-$26%, from 93.0% to 68.8%), compared to its predecessor Opus 4.5 ($-$53%) and GPT-5.2 Pro ($-$40%). Other single frontier models (solid bars) drop 40--63%, while Kaggle competition winners (hatched bars) drop 70--77%. The human baseline of 100% represents task-level solvability (every task solved by at least one person); average individual accuracy is 76% on ARC-AGI-1 and 60% on ARC-AGI-2 chollet_arc-agi-2_2025.
  • Figure 5: Temporal evolution of ARC-AGI performance (2019--2025). Blue: ARC-AGI-1; Amber: ARC-AGI-2. The 2024 phase transition shows performance improving more in six months than the previous five years. The consistent 2.5--3$\times$ degradation from ARC-AGI-1 to ARC-AGI-2 across all approaches indicates fundamental compositional limitations.
  • ...and 7 more figures