Table of Contents
Fetching ...

Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas

Shiqi Chen, Tongyao Zhu, Ruochen Zhou, Jinghan Zhang, Siyang Gao, Juan Carlos Niebles, Mor Geva, Junxian He, Jiajun Wu, Manling Li

TL;DR

This work probes why spatial reasoning is hard for vision-language models by analyzing internal attention mechanisms. It reveals a strong bias toward textual priors and sparse use of image tokens, linking spatial errors to the geometry of attention rather than its quantity. The authors introduce two training-free decoding methods—ScalingVis and AdaptVis—that adjust image-token attention via temperature scaling, with AdaptVis leveraging model confidence to decide when to sharpen or broaden focus. Across synthetic and real datasets (WhatsUp and VSR), these methods yield up to 50 absolute-point improvements with minimal overhead, demonstrating practical gains in spatial grounding and offering a pathway toward more reliable geometric understanding in VLMs.

Abstract

Large Vision Language Models (VLMs) have long struggled with spatial reasoning tasks. Surprisingly, even simple spatial reasoning tasks, such as recognizing "under" or "behind" relationships between only two objects, pose significant challenges for current VLMs. In this work, we study the spatial reasoning challenge from the lens of mechanistic interpretability, diving into the model's internal states to examine the interactions between image and text tokens. By tracing attention distribution over the image through out intermediate layers, we observe that successful spatial reasoning correlates strongly with the model's ability to align its attention distribution with actual object locations, particularly differing between familiar and unfamiliar spatial relationships. Motivated by these findings, we propose ADAPTVIS based on inference-time confidence scores to sharpen the attention on highly relevant regions when confident, while smoothing and broadening the attention window to consider a wider context when confidence is lower. This training-free decoding method shows significant improvement (e.g., up to a 50 absolute point improvement) on spatial reasoning benchmarks such as WhatsUp and VSR with negligible cost. We make code and data publicly available for research purposes at https://github.com/shiqichen17/AdaptVis.

Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas

TL;DR

This work probes why spatial reasoning is hard for vision-language models by analyzing internal attention mechanisms. It reveals a strong bias toward textual priors and sparse use of image tokens, linking spatial errors to the geometry of attention rather than its quantity. The authors introduce two training-free decoding methods—ScalingVis and AdaptVis—that adjust image-token attention via temperature scaling, with AdaptVis leveraging model confidence to decide when to sharpen or broaden focus. Across synthetic and real datasets (WhatsUp and VSR), these methods yield up to 50 absolute-point improvements with minimal overhead, demonstrating practical gains in spatial grounding and offering a pathway toward more reliable geometric understanding in VLMs.

Abstract

Large Vision Language Models (VLMs) have long struggled with spatial reasoning tasks. Surprisingly, even simple spatial reasoning tasks, such as recognizing "under" or "behind" relationships between only two objects, pose significant challenges for current VLMs. In this work, we study the spatial reasoning challenge from the lens of mechanistic interpretability, diving into the model's internal states to examine the interactions between image and text tokens. By tracing attention distribution over the image through out intermediate layers, we observe that successful spatial reasoning correlates strongly with the model's ability to align its attention distribution with actual object locations, particularly differing between familiar and unfamiliar spatial relationships. Motivated by these findings, we propose ADAPTVIS based on inference-time confidence scores to sharpen the attention on highly relevant regions when confident, while smoothing and broadening the attention window to consider a wider context when confidence is lower. This training-free decoding method shows significant improvement (e.g., up to a 50 absolute point improvement) on spatial reasoning benchmarks such as WhatsUp and VSR with negligible cost. We make code and data publicly available for research purposes at https://github.com/shiqichen17/AdaptVis.

Paper Structure

This paper contains 45 sections, 6 equations, 28 figures, 10 tables.

Figures (28)

  • Figure 1: The framework of AdaptVis. We adaptively intervene in the temperature of the attention logits of the image tokens. Top: For generations with low confidence, we smoothen the attention distribution to broaden the context window for better concentration on the correct objects. Bottom: For generations with high confidence, we trust the attention pattern and sharpen the attention distribution.
  • Figure 2: Left: Choice counts, object counts and data source by subset in WhatsUp (“Syn” = Synthetic, “Real” = Real). Right: Evaluation prompts we use in evaluation.
  • Figure 3: A striking imbalance between visual and textual attention: while image tokens take approximately 90% of the sequence length, they receive only about 10% of the model's total attention on WhatsUp. This severe disparity in attention allocation suggests that VLMs fundamentally underutilize visual information.
  • Figure 4: Accuracy of adding image attention in logit space (which corresponds to the multiplication operation in probability space; the x-axis of the figure represents the multiplication coefficient). AdaptVis, on the other hand, utilizes multiplication in logit space.
  • Figure 5: Attention visualization examples from the WhatsUp Dataset. The left two examples are answered correctly, while the right two are incorrect. For correctly answered questions, the attention scores are precisely focused on the core entities mentioned. In contrast, incorrect answers show attention scores distributed to irrelevant image regions. The visualizations use attention from the 17th layer, and the title in each image is an abbreviation of "Where is A in relation to B".
  • ...and 23 more figures