Table of Contents
Fetching ...

Think Visually, Reason Textually: Vision-Language Synergy in ARC

Beichen Zhang, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, Jiaqi Wang

TL;DR

The paper tackles ARC-AGI abstract reasoning by arguing that vision and language offer complementary strengths. It introduces Vision-Language Synergy Reasoning (VLSR), which visually summarizes rules from example grids, and Modality-Switch Self-Correction (MSSC), which verifies textual outputs via visual checks to trigger corrective iterations. Across multiple models and ARC-AGI benchmarks, the combined approach yields an average improvement of about $4.3\%$ over text-only reasoning, with larger gains on certain tasks and models ($\leq 7.25\%$). The authors also demonstrate that training with vision-language cues further enhances performance beyond text-only fine-tuning, underscoring the practical value of visual information in abstract reasoning. Overall, the work argues that a principled fusion of visual abstraction and linguistic precision is a crucial step toward generalizable, human-like intelligence in future foundation models.

Abstract

Abstract reasoning from minimal examples remains a core unsolved problem for frontier foundation models such as GPT-5 and Grok 4. These models still fail to infer structured transformation rules from a handful of examples, which is a key hallmark of human intelligence. The Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI) provides a rigorous testbed for this capability, demanding conceptual rule induction and transfer to novel tasks. Most existing methods treat ARC-AGI as a purely textual reasoning task, overlooking the fact that humans rely heavily on visual abstraction when solving such puzzles. However, our pilot experiments reveal a paradox: naively rendering ARC-AGI grids as images degrades performance due to imprecise rule execution. This leads to our central hypothesis that vision and language possess complementary strengths across distinct reasoning stages: vision supports global pattern abstraction and verification, whereas language specializes in symbolic rule formulation and precise execution. Building on this insight, we introduce two synergistic strategies: (1) Vision-Language Synergy Reasoning (VLSR), which decomposes ARC-AGI into modality-aligned subtasks; and (2) Modality-Switch Self-Correction (MSSC), which leverages vision to verify text-based reasoning for intrinsic error correction. Extensive experiments demonstrate that our approach yields up to a 4.33\% improvement over text-only baselines across diverse flagship models and multiple ARC-AGI tasks. Our findings suggest that unifying visual abstraction with linguistic reasoning is a crucial step toward achieving generalizable, human-like intelligence in future foundation models. Source code is released at https://github.com/InternLM/ARC-VL.

Think Visually, Reason Textually: Vision-Language Synergy in ARC

TL;DR

The paper tackles ARC-AGI abstract reasoning by arguing that vision and language offer complementary strengths. It introduces Vision-Language Synergy Reasoning (VLSR), which visually summarizes rules from example grids, and Modality-Switch Self-Correction (MSSC), which verifies textual outputs via visual checks to trigger corrective iterations. Across multiple models and ARC-AGI benchmarks, the combined approach yields an average improvement of about over text-only reasoning, with larger gains on certain tasks and models (). The authors also demonstrate that training with vision-language cues further enhances performance beyond text-only fine-tuning, underscoring the practical value of visual information in abstract reasoning. Overall, the work argues that a principled fusion of visual abstraction and linguistic precision is a crucial step toward generalizable, human-like intelligence in future foundation models.

Abstract

Abstract reasoning from minimal examples remains a core unsolved problem for frontier foundation models such as GPT-5 and Grok 4. These models still fail to infer structured transformation rules from a handful of examples, which is a key hallmark of human intelligence. The Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI) provides a rigorous testbed for this capability, demanding conceptual rule induction and transfer to novel tasks. Most existing methods treat ARC-AGI as a purely textual reasoning task, overlooking the fact that humans rely heavily on visual abstraction when solving such puzzles. However, our pilot experiments reveal a paradox: naively rendering ARC-AGI grids as images degrades performance due to imprecise rule execution. This leads to our central hypothesis that vision and language possess complementary strengths across distinct reasoning stages: vision supports global pattern abstraction and verification, whereas language specializes in symbolic rule formulation and precise execution. Building on this insight, we introduce two synergistic strategies: (1) Vision-Language Synergy Reasoning (VLSR), which decomposes ARC-AGI into modality-aligned subtasks; and (2) Modality-Switch Self-Correction (MSSC), which leverages vision to verify text-based reasoning for intrinsic error correction. Extensive experiments demonstrate that our approach yields up to a 4.33\% improvement over text-only baselines across diverse flagship models and multiple ARC-AGI tasks. Our findings suggest that unifying visual abstraction with linguistic reasoning is a crucial step toward achieving generalizable, human-like intelligence in future foundation models. Source code is released at https://github.com/InternLM/ARC-VL.

Paper Structure

This paper contains 18 sections, 6 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: We propose vision-text co-reasoning in abstract reasoning tasks. It integrates the unique advantages of visual and textual thinking, thereby outperforming uni-modal reasoning. All methods use o4-mini as the base model.
  • Figure 2: Textual (left) vs. Visual (right) Thinking in the ARC-AGI Task. Previous work treats ARC-AGI as a pure text task for training and reasoning, as text allows for a precise representation of each element. However, this approach loses the intuitiveness of visual thinking and 2D structural information. In contrast, we organically integrate visual thinking and textual thinking into the ARC-AGI reasoning process, using the complementary strengths of different modalities.
  • Figure 3: Overview of our method.a) Vision-Language Synergy Reasoning decomposes ARC-AGI into two subtasks: Rule-summarization and Rule-application. The former visualizes the provided example matrices as images, using global visual perception and 2D structure to summarize the rule. The latter requires element-wise processing, so rule-application is carried out in the textual modality. b) Modality-Switch Self-Correction visualizes the output matrix to judge rule consistency. The results are fed back to implement the self-correction strategy if necessary. As visual information is more informative in rule verification, the model can repeatedly refine its answers without relying on additional inputs.
  • Figure 4: Qualitative comparison of text-only vs. vision-language synergy reasoning on GPT-4o. Text-only reasoning processes elements without spatial context, leading to an incorrect rule. Vision-language synergy reasoning uses global 2D perception in the rule-summarization phase to identify the correct spatial pattern ("retain large connected color blocks").
  • Figure 5: Visual reasoning possesses a global perspective, enabling it to better capture the most critical feature (the colored cross) in the entire image and subsequently summarize the correct rule the underlying rules. Base model is both gpt-4o.
  • ...and 3 more figures