ARC Is a Vision Problem!
Keya Hu, Ali Cy, Linlu Qiu, Xiaoman Delores Ding, Runqian Wang, Yeyin Eva Zhu, Jacob Andreas, Kaiming He
TL;DR
This work reframes the Abstraction and Reasoning Corpus (ARC) as a vision-centric image-to-image translation problem called Vision ARC (VARC). By placing inputs on a flexible canvas, employing a Vision Transformer (with 2D positional embeddings and patch-based tokens), and applying translation/scale priors plus test-time training, VARC achieves 54.5% ARC-1 accuracy with a single model (and 60.4% when ensembling), approaching average human performance. The approach emphasizes visual priors, scalable architecture choices, and multi-view inference to enable strong cross-task generalization from ARC data alone. Altogether, VARC demonstrates that abstract reasoning tasks in ARC can be effectively tackled through vision-centric representations and learning, suggesting broader applicability of image-based priors for reasoning.
Abstract
The Abstraction and Reasoning Corpus (ARC) is designed to promote research on abstract reasoning, a fundamental aspect of human intelligence. Common approaches to ARC treat it as a language-oriented problem, addressed by large language models (LLMs) or recurrent reasoning models. However, although the puzzle-like tasks in ARC are inherently visual, existing research has rarely approached the problem from a vision-centric perspective. In this work, we formulate ARC within a vision paradigm, framing it as an image-to-image translation problem. To incorporate visual priors, we represent the inputs on a "canvas" that can be processed like natural images. It is then natural for us to apply standard vision architectures, such as a vanilla Vision Transformer (ViT), to perform image-to-image mapping. Our model is trained from scratch solely on ARC data and generalizes to unseen tasks through test-time training. Our framework, termed Vision ARC (VARC), achieves 60.4% accuracy on the ARC-1 benchmark, substantially outperforming existing methods that are also trained from scratch. Our results are competitive with those of leading LLMs and close the gap to average human performance.
