Beyond Semantics: Rediscovering Spatial Awareness in Vision-Language Models
Jianing Qi, Jiawei Liu, Hao Tang, Zhigang Zhu
TL;DR
The paper investigates why vision-language models struggle with spatial reasoning despite access to spatial cues. It identifies embedding-norm mismatch between vision and text tokens as a key mechanism that suppresses RoPE-based positional sensitivity, supported by token-norm analyses and residual-stream observations. It introduces an interpretability toolkit (PSI, CMB, RoPE probe) and a synthetic 2DS benchmark to diagnose spatial usage and test interventions. It demonstrates that simple, principled changes—vision embedding normalization and incorporating intermediate visual features—restore spatial sensitivity and improve geometry-aware tasks, offering concrete design guidance for multimodal transformers.
Abstract
Vision Language Models (VLMs) excel at identifying and describing objects but often fail at spatial reasoning. We study why VLMs, such as LLaVA, underutilize spatial cues despite having positional encodings and spatially rich vision encoder features. Our analysis reveals a key imbalance: vision token embeddings have much larger norms than text tokens, suppressing LLM's position embedding. To expose this mechanism, we developed three interpretability tools: (1) the Position Sensitivity Index, which quantifies reliance on token order, (2) the Cross Modality Balance, which reveals attention head allocation patterns, and (3) a RoPE Sensitivity probe, which measures dependence on rotary positional embeddings. These tools uncover that vision tokens and system prompts dominate attention. We validated our mechanistic understanding through targeted interventions that predictably restore positional sensitivity. These findings reveal previously unknown failure modes in multimodal attention and demonstrate how interpretability analysis can guide principled improvements.
