Table of Contents
Fetching ...

How Modality Shapes Perception and Reasoning: A Study of Error Propagation in ARC-AGI

Bo Wen, Chen Wang, Erhan Bilal

TL;DR

This work investigates how input modality shapes perception and reasoning in ARC-like tasks by isolating perception from execution across nine text/image modalities and using a weighted set-disagreement metric plus a two-stage reasoning pipeline. It finds that structured text encodings provide precise coordinates for sparse features, while image encodings preserve 2D structure but suffer patch-size aliasing; combining modalities enables cross-validation that improves both perception and execution (perception gains of about $8$ points and execution gains of about $0.20$ in median similarity). The study offers concrete guidance for selecting context encodings (e.g., json/ascii for coordinates, row/col for directional patterns) and shows that multi-modal inputs can boost robustness without altering the underlying model. Overall, aligning representations with transformer inductive biases and enabling cross-modal checks emerges as a practical strategy to enhance instruction quality and execution reliability in spatial reasoning tasks.

Abstract

ARC-AGI and ARC-AGI-2 measure generalization-through-composition on small color-quantized grids, and their prize competitions make progress on these harder held-out tasks a meaningful proxy for systematic generalization. Recent instruction-first systems translate grids into concise natural-language or DSL rules executed in generate-execute-select loops, yet we lack a principled account of how encodings shape model perception and how to separate instruction errors from execution errors. We hypothesize that modality imposes perceptual bottlenecks -- text flattens 2D structure into 1D tokens while images preserve layout but can introduce patch-size aliasing -- thereby shaping which grid features are reliably perceived. To test this, we isolate perception from reasoning across nine text and image modalities using a weighted set-disagreement metric and a two-stage reasoning pipeline, finding that structured text yields precise coordinates on sparse features, images capture 2D shapes yet are resolution-sensitive, and combining them improves execution (about 8 perception points; about 0.20 median similarity). Overall, aligning representations with transformer inductive biases and enabling cross-validation between text and image yields more accurate instructions and more reliable execution without changing the underlying model.

How Modality Shapes Perception and Reasoning: A Study of Error Propagation in ARC-AGI

TL;DR

This work investigates how input modality shapes perception and reasoning in ARC-like tasks by isolating perception from execution across nine text/image modalities and using a weighted set-disagreement metric plus a two-stage reasoning pipeline. It finds that structured text encodings provide precise coordinates for sparse features, while image encodings preserve 2D structure but suffer patch-size aliasing; combining modalities enables cross-validation that improves both perception and execution (perception gains of about points and execution gains of about in median similarity). The study offers concrete guidance for selecting context encodings (e.g., json/ascii for coordinates, row/col for directional patterns) and shows that multi-modal inputs can boost robustness without altering the underlying model. Overall, aligning representations with transformer inductive biases and enabling cross-modal checks emerges as a practical strategy to enhance instruction quality and execution reliability in spatial reasoning tasks.

Abstract

ARC-AGI and ARC-AGI-2 measure generalization-through-composition on small color-quantized grids, and their prize competitions make progress on these harder held-out tasks a meaningful proxy for systematic generalization. Recent instruction-first systems translate grids into concise natural-language or DSL rules executed in generate-execute-select loops, yet we lack a principled account of how encodings shape model perception and how to separate instruction errors from execution errors. We hypothesize that modality imposes perceptual bottlenecks -- text flattens 2D structure into 1D tokens while images preserve layout but can introduce patch-size aliasing -- thereby shaping which grid features are reliably perceived. To test this, we isolate perception from reasoning across nine text and image modalities using a weighted set-disagreement metric and a two-stage reasoning pipeline, finding that structured text yields precise coordinates on sparse features, images capture 2D shapes yet are resolution-sensitive, and combining them improves execution (about 8 perception points; about 0.20 median similarity). Overall, aligning representations with transformer inductive biases and enabling cross-validation between text and image yields more accurate instructions and more reliable execution without changing the underlying model.

Paper Structure

This paper contains 118 sections, 1 equation, 3 figures, 1 table.

Figures (3)

  • Figure 1: Example of image modality representation showing 2×2 grid regions from the upper left (columns A--B, rows 1--2) and lower right (columns AC--AD, rows 29--30) corners of a 30×30 grid from challenge 0934a4d8. Each cell displays its spreadsheet coordinate label (e.g., "A1", "B2", "AC29") along with its color value. The coordinate labels use a vertical layout to fit within the cell boundaries, enabling precise spatial grounding for vision models. This example uses the image_16x16 modality (16×16 pixels per cell).
  • Figure 2: Score distributions by modality (sorted by median) pooled across experiment/test and context variants on challenge 13e47133. Each violin plot shows the probability density of similarity scores, with horizontal lines indicating medians and individual gray dots representing specific data points. Modalities are sorted left-to-right by increasing median score. The plot includes scores from all three training examples (via held-one-out validation) and two test cases, revealing systematic differences in execution accuracy across modality combinations.
  • Figure 3: Visual comparison of tokenization for row_only, ascii, and json formats using the same 3×10 grid example, generated using OpenAI's tokenizer openaitokenizer. Each token is highlighted with a distinct color, showing how the same grid data is segmented differently across modalities. Note that different LLMs may tokenize the same input differently depending on their tokenizer implementation. The row_only format often groups consecutive digits into multi-digit tokens (behavior is tokenizer-specific), ascii separates each digit and space, and json treats brackets and commas as distinct structural tokens.