Beyond the Vision Encoder: Identifying and Mitigating Spatial Bias in Large Vision-Language Models
Yingjie Zhu, Xuefeng Bai, Kehai Chen, Yang Xiang, Youcheng Pan, Yongshuai Hou, Weili Guan, Jun Yu, Min Zhang
TL;DR
This work reveals that large vision–language models exhibit spatial bias in semantic understanding due to mismatches between vision encoder flow and autoregressive LLM reasoning. A systematic probing task shows predictions shift with the spatial placement of key visual content, and analyses pinpoint the root cause to cross-modal attention dynamics rather than perception or vision encoding. The authors propose Adaptive Global Context Injection (AGCI), a lightweight method that injects a global visual context into each image token via v_i' = v_i + \lambda(1 - w_i) g with w_i = cos(v_i, g), restoring holistic image information during cross-modal reasoning. Empirical results across six multimodal benchmarks and multiple LVLMs show AGCI improves spatial robustness, reduces hallucinations, and enhances downstream performance with minimal architectural changes. The approach emphasizes the importance of global visual context and offers a practical, generalizable fix for spatial biases in LVLMs.
Abstract
Large Vision-Language Models (LVLMs) have achieved remarkable success across a wide range of multimodal tasks, yet their robustness to spatial variations remains insufficiently understood. In this work, we conduct a systematic study of the spatial bias of LVLMs, examining how models respond when identical key visual information is placed at different locations within an image. Through controlled probing experiments, we observe that current LVLMs often produce inconsistent outputs under such spatial shifts, revealing a clear spatial bias in their semantic understanding. Further analysis indicates that this bias does not stem from the vision encoder, but rather from a mismatch in attention mechanisms between the vision encoder and the large language model, which disrupts the global information flow. Motivated by this insight, we propose Adaptive Global Context Injection (AGCI), a lightweight mechanism that dynamically injects shared global visual context into each image token. AGCI works without architectural modifications, mitigating spatial bias by enhancing the semantic accessibility of image tokens while preserving the model's intrinsic capabilities. Extensive experiments demonstrate that AGCI not only enhances the spatial robustness of LVLMs, but also achieves strong performance on various downstream tasks and hallucination benchmarks.
