VLMs have Tunnel Vision: Evaluating Nonlocal Visual Reasoning in Leading VLMs
Shmuel Berman, Jia Deng
TL;DR
This paper interrogates nonlocal visual reasoning in leading Vision-Language Models by introducing three procedurally generated task suites that separately probe comparative perception, saccadic search, and smooth visual search. Despite high perceptual acuity on traditional benchmarks, the evaluated models show substantial deficits in executing visual algorithms that humans perform effortlessly, with several relying on language priors rather than direct visual evidence. The authors provide a detailed analysis of failure modes, heuristic strategies, and the limits of self-correction, demonstrating that current VLMs struggle to generalize visual reasoning across varied, minimally biased scenarios. The work supplies a reusable evaluation framework and highlights the need for integrating robust visual-algorithmic reasoning into future VLM architectures for more reliable, human-like image understanding.
Abstract
Vision-Language Models (VLMs) excel at complex visual tasks such as VQA and chart understanding, yet recent work suggests they struggle with simple perceptual tests. We present an evaluation of vision-language models' capacity for nonlocal visual reasoning: reasoning that requires chaining evidence collected from multiple, possibly distant regions of an image. We isolate three distinct forms of nonlocal vision: comparative perception, which demands holding two images in working memory and comparing them; saccadic search, which requires making discrete, evidence-driven jumps to locate successive targets; and smooth visual search, which involves following a continuous contour. Flagship models (e.g., GPT-5, Gemini 2.5 Pro, Claude Sonnet 4), even those that perform well on prior primitive-vision benchmarks, fail these tests and barely exceed random accuracy on two variants of our tasks that are trivial for humans. Our structured evaluation suite allows us to test whether VLMs can perform visual algorithms similar to those used by humans. Our findings show that despite gains in raw visual acuity, current models lack core visual reasoning capabilities.
