Table of Contents
Fetching ...

EasyARC: Evaluating Vision Language Models on True Visual Reasoning

Mert Unsal, Aylin Akkus

TL;DR

EasyARC introduces a true visual reasoning benchmark for vision-language models, addressing gaps in prior tasks that emphasize extraction over multi-step reasoning. By procedurally generating scalable, verifiable, multi-image datasets with progressive difficulty inspired by ARC, it enables rigorous evaluation and RL-friendly training. The evaluation reveals that current SoTA VLMs largely struggle to perform true visual reasoning, with Claude 3.7 Sonnet showing the strongest but still limited performance and clear failure modes across tasks. The work provides open-source data and prompts to foster research into true visual reasoning and test-time scaling, offering a new standard for multimodal evaluation.

Abstract

Building on recent advances in language-based reasoning models, we explore multimodal reasoning that integrates vision and text. Existing multimodal benchmarks primarily test visual extraction combined with text-based reasoning, lacking true visual reasoning with more complex interactions between vision and language. Inspired by the ARC challenge, we introduce EasyARC, a vision-language benchmark requiring multi-image, multi-step reasoning, and self-correction. EasyARC is procedurally generated, fully verifiable, and scalable, making it ideal for reinforcement learning (RL) pipelines. The generators incorporate progressive difficulty levels, enabling structured evaluation across task types and complexities. We benchmark state-of-the-art vision-language models and analyze their failure modes. We argue that EasyARC sets a new standard for evaluating true reasoning and test-time scaling capabilities in vision-language models. We open-source our benchmark dataset and evaluation code.

EasyARC: Evaluating Vision Language Models on True Visual Reasoning

TL;DR

EasyARC introduces a true visual reasoning benchmark for vision-language models, addressing gaps in prior tasks that emphasize extraction over multi-step reasoning. By procedurally generating scalable, verifiable, multi-image datasets with progressive difficulty inspired by ARC, it enables rigorous evaluation and RL-friendly training. The evaluation reveals that current SoTA VLMs largely struggle to perform true visual reasoning, with Claude 3.7 Sonnet showing the strongest but still limited performance and clear failure modes across tasks. The work provides open-source data and prompts to foster research into true visual reasoning and test-time scaling, offering a new standard for multimodal evaluation.

Abstract

Building on recent advances in language-based reasoning models, we explore multimodal reasoning that integrates vision and text. Existing multimodal benchmarks primarily test visual extraction combined with text-based reasoning, lacking true visual reasoning with more complex interactions between vision and language. Inspired by the ARC challenge, we introduce EasyARC, a vision-language benchmark requiring multi-image, multi-step reasoning, and self-correction. EasyARC is procedurally generated, fully verifiable, and scalable, making it ideal for reinforcement learning (RL) pipelines. The generators incorporate progressive difficulty levels, enabling structured evaluation across task types and complexities. We benchmark state-of-the-art vision-language models and analyze their failure modes. We argue that EasyARC sets a new standard for evaluating true reasoning and test-time scaling capabilities in vision-language models. We open-source our benchmark dataset and evaluation code.

Paper Structure

This paper contains 24 sections, 11 figures.

Figures (11)

  • Figure 1: Example Task from EasyARC: The transformation is to identify the largest connected component of non-background color and fill the answer with the component flattened. All SoTA VLMs struggle to understand or solve this example.
  • Figure 2: Example ARC task from the public evaluation set: Visually, this task is simple as it resembles stacking rectangles in a three-dimensional manner.
  • Figure 3: Success Rate of VLMs on EasyARC
  • Figure 4: Claude 3.7 success rate across problem types.
  • Figure 5: Example Input-Output for Counting Cells Task
  • ...and 6 more figures