Table of Contents
Fetching ...

Visual serial processing deficits explain divergences in human and VLM reasoning

Nicholas Budny, Kia Ghods, Declan Campbell, Raja Marjieh, Amogh Joshi, Sreejan Kumar, Jonathan D. Cohen, Taylor W. Webb, Thomas L. Griffiths

TL;DR

This work investigates why Vision-Language Models underperform humans on basic visual reasoning tasks by proposing a visually grounded serial processing deficit as a unifying explanation. The authors test this hypothesis across three domains—geometric concepts, numerical estimation, and mental rotation—demonstrating that VLM accuracy declines with increasing serial processing demands while human RT increases, revealing a robust human-model gap tied to serial reasoning. They provide causal evidence by augmenting models with serial reasoning capabilities (CoT, training, tool use), showing improvements in task-specific settings and highlighting the limits of current augmentation strategies. The results motivate architectural directions toward intrinsically visually grounded, region-attached serial processing via visually grounded reinforcement learning and region-focused attention to bridge the gap between human and machine visual reasoning.

Abstract

Why do Vision Language Models (VLMs), despite success on standard benchmarks, often fail to match human performance on surprisingly simple visual reasoning tasks? While the underlying computational principles are still debated, we hypothesize that a crucial factor is a deficit in visually-grounded serial processing. To test this hypothesis, we compared human and VLM performance across tasks designed to vary serial processing demands in three distinct domains: geometric reasoning, perceptual enumeration, and mental rotation. Tasks within each domain varied serial processing load by manipulating factors such as geometric concept complexity, perceptual individuation load, and transformation difficulty. Across all domains, our results revealed a consistent pattern: decreased VLM accuracy was strongly correlated with increased human reaction time (used as a proxy for serial processing load). As tasks require more demanding serial processing -- whether composing concepts, enumerating items, or performing mental transformations -- the VLM-human performance gap widens reliably. These findings support our hypothesis, indicating that limitations in serial, visually grounded reasoning represent a fundamental bottleneck that distinguishes current VLMs from humans.

Visual serial processing deficits explain divergences in human and VLM reasoning

TL;DR

This work investigates why Vision-Language Models underperform humans on basic visual reasoning tasks by proposing a visually grounded serial processing deficit as a unifying explanation. The authors test this hypothesis across three domains—geometric concepts, numerical estimation, and mental rotation—demonstrating that VLM accuracy declines with increasing serial processing demands while human RT increases, revealing a robust human-model gap tied to serial reasoning. They provide causal evidence by augmenting models with serial reasoning capabilities (CoT, training, tool use), showing improvements in task-specific settings and highlighting the limits of current augmentation strategies. The results motivate architectural directions toward intrinsically visually grounded, region-attached serial processing via visually grounded reinforcement learning and region-focused attention to bridge the gap between human and machine visual reasoning.

Abstract

Why do Vision Language Models (VLMs), despite success on standard benchmarks, often fail to match human performance on surprisingly simple visual reasoning tasks? While the underlying computational principles are still debated, we hypothesize that a crucial factor is a deficit in visually-grounded serial processing. To test this hypothesis, we compared human and VLM performance across tasks designed to vary serial processing demands in three distinct domains: geometric reasoning, perceptual enumeration, and mental rotation. Tasks within each domain varied serial processing load by manipulating factors such as geometric concept complexity, perceptual individuation load, and transformation difficulty. Across all domains, our results revealed a consistent pattern: decreased VLM accuracy was strongly correlated with increased human reaction time (used as a proxy for serial processing load). As tasks require more demanding serial processing -- whether composing concepts, enumerating items, or performing mental transformations -- the VLM-human performance gap widens reliably. These findings support our hypothesis, indicating that limitations in serial, visually grounded reasoning represent a fundamental bottleneck that distinguishes current VLMs from humans.

Paper Structure

This paper contains 41 sections, 14 figures, 2 tables.

Figures (14)

  • Figure 1: Geometric Reasoning Task.a) Example oddball detection trials from a subset of 37 geometric concepts spanning both primitive elements (left) and relational constraints (right), generated with Geoclidean DSL. b) Relationship between z-scored human reaction time and model accuracy; each point is one geometric concept and color denotes program complexity (Minimum Description Length, MDL). c) Correlation between human and model accuracy across concepts. d) Trial-level correlation between human reaction time (RT) and model accuracy. e) Human RT (blue, left axis) and model accuracy (red, right axis) as a function of MDL. Shaded regions are 95% confidence intervals.
  • Figure 2: Numerosity Task.a) Example stimuli from the four experimental conditions: non-overlapping uniformly colored (top-left), non-overlapping uniquely colored (top-right), overlapping uniformly colored (bottom-left), and overlapping uniquely colored (bottom-right). b) Model accuracy as a function of numerosity across the four conditions. c) Mean accuracy for humans and models across each condition, averaged across numerosities. d) Human z-scored reaction times as a function of numerosity across the four conditions. Shaded regions indicate 95% confidence intervals.
  • Figure 3: Mental Rotation Task.a) Example stimuli from the mental rotation task. Participants and models judged whether a rotated letter was the same or a mirror-reversed version of a reference. b) Human error rate (solid line), model error rate (dashed line), and z-scored human reaction time (right axis) plotted against relative rotation angle. Shaded regions denote 95% confidence intervals. c) Relationship between z-scored human reaction time and model accuracy across rotation angles ($\leq 90^{\circ}$). Each point corresponds to a single rotation angle; colors indicate relative angular disparity.
  • Figure 4: Results for Augmented VLMs.a) Accuracy as a function of MDL for humans and different model conditions. b) Error rate as a function of relative rotation angle for humans (green line) and models under different augmentation paradigms. c) Mean accuracy for humans and models across each condition, averaged across numerosities. d) Accuracy on a challenging subset of Geoclidean and rotation tasks, comparing humans, reasoning-augmented models, and GPT-o3 with tool use. e) Accuracy by condition for $n = 8$ items, across humans, reasoning models, and GPT-o3 with tool use. Shaded regions in a) and b) represent 95% confidence intervals.
  • Figure 5: Human RT vs. Model Accuracy by Model. Z-scored human reaction time and model accuracy plotted across geometric concepts, shown separately for each VLM. Each point corresponds to one concept, colored by program complexity (MDL). Black lines show linear regression fits with 95% confidence intervals.
  • ...and 9 more figures