Visual serial processing deficits explain divergences in human and VLM reasoning

Nicholas Budny; Kia Ghods; Declan Campbell; Raja Marjieh; Amogh Joshi; Sreejan Kumar; Jonathan D. Cohen; Taylor W. Webb; Thomas L. Griffiths

Visual serial processing deficits explain divergences in human and VLM reasoning

Nicholas Budny, Kia Ghods, Declan Campbell, Raja Marjieh, Amogh Joshi, Sreejan Kumar, Jonathan D. Cohen, Taylor W. Webb, Thomas L. Griffiths

TL;DR

This work investigates why Vision-Language Models underperform humans on basic visual reasoning tasks by proposing a visually grounded serial processing deficit as a unifying explanation. The authors test this hypothesis across three domains—geometric concepts, numerical estimation, and mental rotation—demonstrating that VLM accuracy declines with increasing serial processing demands while human RT increases, revealing a robust human-model gap tied to serial reasoning. They provide causal evidence by augmenting models with serial reasoning capabilities (CoT, training, tool use), showing improvements in task-specific settings and highlighting the limits of current augmentation strategies. The results motivate architectural directions toward intrinsically visually grounded, region-attached serial processing via visually grounded reinforcement learning and region-focused attention to bridge the gap between human and machine visual reasoning.

Abstract

Why do Vision Language Models (VLMs), despite success on standard benchmarks, often fail to match human performance on surprisingly simple visual reasoning tasks? While the underlying computational principles are still debated, we hypothesize that a crucial factor is a deficit in visually-grounded serial processing. To test this hypothesis, we compared human and VLM performance across tasks designed to vary serial processing demands in three distinct domains: geometric reasoning, perceptual enumeration, and mental rotation. Tasks within each domain varied serial processing load by manipulating factors such as geometric concept complexity, perceptual individuation load, and transformation difficulty. Across all domains, our results revealed a consistent pattern: decreased VLM accuracy was strongly correlated with increased human reaction time (used as a proxy for serial processing load). As tasks require more demanding serial processing -- whether composing concepts, enumerating items, or performing mental transformations -- the VLM-human performance gap widens reliably. These findings support our hypothesis, indicating that limitations in serial, visually grounded reasoning represent a fundamental bottleneck that distinguishes current VLMs from humans.

Visual serial processing deficits explain divergences in human and VLM reasoning

TL;DR

Abstract

Visual serial processing deficits explain divergences in human and VLM reasoning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (14)