Table of Contents
Fetching ...

Do AI Models Perform Human-like Abstract Reasoning Across Modalities?

Claas Beger, Ryan Yi, Shuhao Fu, Arseny Moskvichev, Sarah W. Tsai, Sivasankaran Rajamanickam, Melanie Mitchell

TL;DR

The study investigates whether AI models exhibit human-like abstract reasoning across textual and visual modalities on ConceptARC, moving beyond accuracy to analyze the rules models generate. It finds that in text, some models can match or exceed human grid accuracy, but a sizable fraction of correct outputs rely on unintended shortcuts, suggesting surface-pattern reasoning rather than true abstractions. In the visual modality, accuracy drops sharply, though rule analyses indicate some capacity to capture abstractions that assistants fail to apply correctly, highlighting modality-dependent gaps. The work advocates for evaluating both outputs and explanatory rules to more faithfully assess multimodal abstract reasoning and to guide future improvements toward human-like abstraction and explainability.

Abstract

OpenAI's o3-preview reasoning model exceeded human accuracy on the ARC-AGI benchmark, but does that mean state-of-the-art models recognize and reason with the abstractions that the task creators intended? We investigate models' abstraction abilities on ConceptARC. We evaluate models under settings that vary the input modality (textual vs. visual), whether the model is permitted to use external Python tools, and, for reasoning models, the amount of reasoning effort. In addition to measuring output accuracy, we perform fine-grained evaluation of the natural-language rules that models generate to explain their solutions. This dual evaluation lets us assess whether models solve tasks using the abstractions ConceptARC was designed to elicit, rather than relying on surface-level patterns. Our results show that, while some models using text-based representations match human output accuracy, the best models' rules are often based on surface-level ``shortcuts'' and capture intended abstractions far less often than humans. Thus their capabilities for general abstract reasoning may be overestimated by evaluations based on accuracy alone. In the visual modality, AI models' output accuracy drops sharply, yet our rule-level analysis reveals that models might be underestimated, as they still exhibit a substantial share of rules that capture intended abstractions, but are often unable to correctly apply these rules. In short, our results show that models still lag humans in abstract reasoning, and that using accuracy alone to evaluate abstract reasoning on ARC-like tasks may overestimate abstract-reasoning capabilities in textual modalities and underestimate it in visual modalities. We believe that our evaluation framework offers a more faithful picture of multimodal models' abstract reasoning abilities and a more principled way to track progress toward human-like, abstraction-centered intelligence.

Do AI Models Perform Human-like Abstract Reasoning Across Modalities?

TL;DR

The study investigates whether AI models exhibit human-like abstract reasoning across textual and visual modalities on ConceptARC, moving beyond accuracy to analyze the rules models generate. It finds that in text, some models can match or exceed human grid accuracy, but a sizable fraction of correct outputs rely on unintended shortcuts, suggesting surface-pattern reasoning rather than true abstractions. In the visual modality, accuracy drops sharply, though rule analyses indicate some capacity to capture abstractions that assistants fail to apply correctly, highlighting modality-dependent gaps. The work advocates for evaluating both outputs and explanatory rules to more faithfully assess multimodal abstract reasoning and to guide future improvements toward human-like abstraction and explainability.

Abstract

OpenAI's o3-preview reasoning model exceeded human accuracy on the ARC-AGI benchmark, but does that mean state-of-the-art models recognize and reason with the abstractions that the task creators intended? We investigate models' abstraction abilities on ConceptARC. We evaluate models under settings that vary the input modality (textual vs. visual), whether the model is permitted to use external Python tools, and, for reasoning models, the amount of reasoning effort. In addition to measuring output accuracy, we perform fine-grained evaluation of the natural-language rules that models generate to explain their solutions. This dual evaluation lets us assess whether models solve tasks using the abstractions ConceptARC was designed to elicit, rather than relying on surface-level patterns. Our results show that, while some models using text-based representations match human output accuracy, the best models' rules are often based on surface-level ``shortcuts'' and capture intended abstractions far less often than humans. Thus their capabilities for general abstract reasoning may be overestimated by evaluations based on accuracy alone. In the visual modality, AI models' output accuracy drops sharply, yet our rule-level analysis reveals that models might be underestimated, as they still exhibit a substantial share of rules that capture intended abstractions, but are often unable to correctly apply these rules. In short, our results show that models still lag humans in abstract reasoning, and that using accuracy alone to evaluate abstract reasoning on ARC-like tasks may overestimate abstract-reasoning capabilities in textual modalities and underestimate it in visual modalities. We believe that our evaluation framework offers a more faithful picture of multimodal models' abstract reasoning abilities and a more principled way to track progress toward human-like, abstraction-centered intelligence.

Paper Structure

This paper contains 24 sections, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Each row shows a task from the ConceptARC benchmark. Each task shown consists of three demonstrations of a transformation and one test grid. In this study, the solver is tasked with generating a rule that describes the transformations and applying that rule to the test grid.
  • Figure 2: Results of rule evaluations. For each model in each modality (as well as humans), two bars are given, representing the percentage of correct and incorrect grid outputs over the 480 ConceptARC tasks. Each bar shows the fraction of tasks for which the rule is correct-intended, correct-unintended, and incorrect. The gray areas in the human-result bars represent rules that we could not classify---see \ref{['sec:Rule_evaluation']} for details. The actual percentages corresponding to regions on the bars are given in \ref{['app:rule_evaluation_data']}.
  • Figure 3: Results of rule evaluations for o3 across all settings. As in \ref{['fig:rule_eval_plot']}, two bars showing the percentage of correct and incorrect output grids are included for each setting, with each bar showing the fraction of tasks for which the generated rule is correct-intended, correct-unintended, and incorrect. The actual percentages corresponding to regions on the bars are given in \ref{['app:rule_evaluation_data']}.
  • Figure 4: Examples of correct-unintended rules. Top: o3, using medium effort and tools, performs shallow inference for a task from the Horizontal vs. Vertical concept group. The model does not recognize the relation between the orientation of the colored shape components and the blue row, but rather focuses on whether a blue ("8") pixel appears in the grid. In this case, the correct-intended rule works for the given test case, but does not work for other test variants. Middle: o3, using medium effort and tools, on a task from the Complete Shape concept group. The model does not recognize the relation between the colored output shape and the gray prototype and instead overfits to the training examples, producing a correct-intended rule based on shallow features. Bottom: Claude Sonnet 4 uses a density heuristic to approximate the most overlapped figure on a task from the Top vs. bottom 3D group. While this works for some of the test examples, it does not capture the notion of the bottommost shape in a 3D stack, and there are several possible scenarios for which this approach fails.
  • Figure 5: We show two example demonstrations from the concept with the highest and lowest gap between Human and Model performance, CleanUp and Count. Further, we show concept-wise output grid accuracy across three reasoning models in a medium with tools setting (note that we compare against the strongest setting in \ref{['sec:Concept_diff']} instead).
  • ...and 2 more figures