Table of Contents
Fetching ...

Beyond Semantics: Rediscovering Spatial Awareness in Vision-Language Models

Jianing Qi, Jiawei Liu, Hao Tang, Zhigang Zhu

TL;DR

The paper investigates why vision-language models struggle with spatial reasoning despite access to spatial cues. It identifies embedding-norm mismatch between vision and text tokens as a key mechanism that suppresses RoPE-based positional sensitivity, supported by token-norm analyses and residual-stream observations. It introduces an interpretability toolkit (PSI, CMB, RoPE probe) and a synthetic 2DS benchmark to diagnose spatial usage and test interventions. It demonstrates that simple, principled changes—vision embedding normalization and incorporating intermediate visual features—restore spatial sensitivity and improve geometry-aware tasks, offering concrete design guidance for multimodal transformers.

Abstract

Vision Language Models (VLMs) excel at identifying and describing objects but often fail at spatial reasoning. We study why VLMs, such as LLaVA, underutilize spatial cues despite having positional encodings and spatially rich vision encoder features. Our analysis reveals a key imbalance: vision token embeddings have much larger norms than text tokens, suppressing LLM's position embedding. To expose this mechanism, we developed three interpretability tools: (1) the Position Sensitivity Index, which quantifies reliance on token order, (2) the Cross Modality Balance, which reveals attention head allocation patterns, and (3) a RoPE Sensitivity probe, which measures dependence on rotary positional embeddings. These tools uncover that vision tokens and system prompts dominate attention. We validated our mechanistic understanding through targeted interventions that predictably restore positional sensitivity. These findings reveal previously unknown failure modes in multimodal attention and demonstrate how interpretability analysis can guide principled improvements.

Beyond Semantics: Rediscovering Spatial Awareness in Vision-Language Models

TL;DR

The paper investigates why vision-language models struggle with spatial reasoning despite access to spatial cues. It identifies embedding-norm mismatch between vision and text tokens as a key mechanism that suppresses RoPE-based positional sensitivity, supported by token-norm analyses and residual-stream observations. It introduces an interpretability toolkit (PSI, CMB, RoPE probe) and a synthetic 2DS benchmark to diagnose spatial usage and test interventions. It demonstrates that simple, principled changes—vision embedding normalization and incorporating intermediate visual features—restore spatial sensitivity and improve geometry-aware tasks, offering concrete design guidance for multimodal transformers.

Abstract

Vision Language Models (VLMs) excel at identifying and describing objects but often fail at spatial reasoning. We study why VLMs, such as LLaVA, underutilize spatial cues despite having positional encodings and spatially rich vision encoder features. Our analysis reveals a key imbalance: vision token embeddings have much larger norms than text tokens, suppressing LLM's position embedding. To expose this mechanism, we developed three interpretability tools: (1) the Position Sensitivity Index, which quantifies reliance on token order, (2) the Cross Modality Balance, which reveals attention head allocation patterns, and (3) a RoPE Sensitivity probe, which measures dependence on rotary positional embeddings. These tools uncover that vision tokens and system prompts dominate attention. We validated our mechanistic understanding through targeted interventions that predictably restore positional sensitivity. These findings reveal previously unknown failure modes in multimodal attention and demonstrate how interpretability analysis can guide principled improvements.

Paper Structure

This paper contains 36 sections, 12 equations, 15 figures, 8 tables.

Figures (15)

  • Figure 1: Permutation Test: Original (top) vs. randomly permuted vision tokens (bottom). Despite losing spatial ordering, the LLM accurately responds to the prompt "Describe the image," demonstrating strong robustness and a notable "bag-of-tokens" tendency, losing spatial relationships. Token embeddings are visualized using cosine similarity relative to a reference token.
  • Figure 2: Performance impact of vision token compression on standard benchmarks (GQA, CV-Bench 2D, and POPE). Only minor accuracy degradation occurs, even under extreme token compression (down to a single token).
  • Figure 3: Distribution of $L_2$ norms for vision and text tokens in COCO validation dataset (log scale). Vision token norms range between $10^1$ and $10^3$, while text token norms range between $3\times10^{-1}$ and $10^0$.
  • Figure 4: Residual stream norms across depth. Layer-wise averages of hidden state norms. Left: text tokens; middle: vision tokens; right: ratio $\frac{vision}{text}$. Vision norms are up to an order of magnitude larger than text early on, and the imbalance remains until $\sim$layer 15. +Normalize and +Normalize +Multilayer are our interventions in Section \ref{['sec:restoration']}, and they balance the vision norm.
  • Figure 5: 2DS examples. Left: two-object layouts; right: three/four-object layouts. Semantics are simple to focus on spatial relations.
  • ...and 10 more figures