Table of Contents
Fetching ...

VLMs have Tunnel Vision: Evaluating Nonlocal Visual Reasoning in Leading VLMs

Shmuel Berman, Jia Deng

TL;DR

This paper interrogates nonlocal visual reasoning in leading Vision-Language Models by introducing three procedurally generated task suites that separately probe comparative perception, saccadic search, and smooth visual search. Despite high perceptual acuity on traditional benchmarks, the evaluated models show substantial deficits in executing visual algorithms that humans perform effortlessly, with several relying on language priors rather than direct visual evidence. The authors provide a detailed analysis of failure modes, heuristic strategies, and the limits of self-correction, demonstrating that current VLMs struggle to generalize visual reasoning across varied, minimally biased scenarios. The work supplies a reusable evaluation framework and highlights the need for integrating robust visual-algorithmic reasoning into future VLM architectures for more reliable, human-like image understanding.

Abstract

Vision-Language Models (VLMs) excel at complex visual tasks such as VQA and chart understanding, yet recent work suggests they struggle with simple perceptual tests. We present an evaluation of vision-language models' capacity for nonlocal visual reasoning: reasoning that requires chaining evidence collected from multiple, possibly distant regions of an image. We isolate three distinct forms of nonlocal vision: comparative perception, which demands holding two images in working memory and comparing them; saccadic search, which requires making discrete, evidence-driven jumps to locate successive targets; and smooth visual search, which involves following a continuous contour. Flagship models (e.g., GPT-5, Gemini 2.5 Pro, Claude Sonnet 4), even those that perform well on prior primitive-vision benchmarks, fail these tests and barely exceed random accuracy on two variants of our tasks that are trivial for humans. Our structured evaluation suite allows us to test whether VLMs can perform visual algorithms similar to those used by humans. Our findings show that despite gains in raw visual acuity, current models lack core visual reasoning capabilities.

VLMs have Tunnel Vision: Evaluating Nonlocal Visual Reasoning in Leading VLMs

TL;DR

This paper interrogates nonlocal visual reasoning in leading Vision-Language Models by introducing three procedurally generated task suites that separately probe comparative perception, saccadic search, and smooth visual search. Despite high perceptual acuity on traditional benchmarks, the evaluated models show substantial deficits in executing visual algorithms that humans perform effortlessly, with several relying on language priors rather than direct visual evidence. The authors provide a detailed analysis of failure modes, heuristic strategies, and the limits of self-correction, demonstrating that current VLMs struggle to generalize visual reasoning across varied, minimally biased scenarios. The work supplies a reusable evaluation framework and highlights the need for integrating robust visual-algorithmic reasoning into future VLM architectures for more reliable, human-like image understanding.

Abstract

Vision-Language Models (VLMs) excel at complex visual tasks such as VQA and chart understanding, yet recent work suggests they struggle with simple perceptual tests. We present an evaluation of vision-language models' capacity for nonlocal visual reasoning: reasoning that requires chaining evidence collected from multiple, possibly distant regions of an image. We isolate three distinct forms of nonlocal vision: comparative perception, which demands holding two images in working memory and comparing them; saccadic search, which requires making discrete, evidence-driven jumps to locate successive targets; and smooth visual search, which involves following a continuous contour. Flagship models (e.g., GPT-5, Gemini 2.5 Pro, Claude Sonnet 4), even those that perform well on prior primitive-vision benchmarks, fail these tests and barely exceed random accuracy on two variants of our tasks that are trivial for humans. Our structured evaluation suite allows us to test whether VLMs can perform visual algorithms similar to those used by humans. Our findings show that despite gains in raw visual acuity, current models lack core visual reasoning capabilities.

Paper Structure

This paper contains 24 sections, 11 figures, 2 tables.

Figures (11)

  • Figure 1: Object Re-Identification (top): Determine whether the same object that appears in Image 1 also appears in Image 2, up to a transform of the entire object but not individual component shapes. Visual Scavenger Hunt (bottom-left): From the indicated shape, follow the written labels for the specified count and report the final shape’s color. Circuit Connections (bottom-right): From a named port on the central breadboard, trace the wire to its endpoint. The prompts here are abbreviated; full instructions are in the appendix.
  • Figure 2: Accuracy on all variants of Object Re-Identification.
  • Figure 3: F1 for positive and negative classes across all trials of Object Re-Identification. Some models predict identically across the majority of trials (red box.) The strong models (orange box) perform poorly on the standard variant, but become better at recognizing similar objects when tested on the other two trials.
  • Figure 4: Accuracy on the Visual Scavenger Hunt task. Only GPT-5, o4-mini, and Gemini 2.5 Pro significantly outperform random chance.
  • Figure 5: Example responses for Visual Scavenger Hunt from our qualitative analysis. Most models can locate the first shape but have trouble extending the chain from there. o4-mini and Gemini 2.5 Pro are both high-performing but use different strategies.
  • ...and 6 more figures