Table of Contents
Fetching ...

VLMs Need Words: Vision Language Models Ignore Visual Detail In Favor of Semantic Anchors

Haz Sameen Shahgir, Xiaofu Chen, Yu Fu, Erfan Shayegani, Nael Abu-Ghazaleh, Yova Kementchedjhieva, Yue Dong

Abstract

Vision Language Models (VLMs) achieve impressive performance across a wide range of multimodal tasks. However, on some tasks that demand fine-grained visual perception, they often fail even when the required information is present in their internal representations. In this work, we demonstrate that this gap arises from their narrow training pipeline which focuses on moving visual information to the textual space. Consequently, VLMs can only reason about visual entities that can be mapped to known concepts in the language space, leaving vision-focused tasks such as visual correspondence and reasoning about novel visual entities poorly supported. As a result, VLMs are severely limited in several important multimodal capabilities because they rely on brittle, hallucinated textual descriptions of visual entities that they cannot map to textual representations. We verify this behavior through visual correspondence tasks, in which VLMs must detect matching entities between two images. Testing across semantic, shape, and face correspondence tasks, we find that VLMs perform much better when the relevant entities are nameable in language than when they are unnameable. Mechanistically, our Logit Lens analyses confirm that VLMs explicitly assign semantic labels to nameable entities and surface more unique corresponding tokens compared to unnameable entities. Furthermore, we show that teaching completely arbitrary names for unknown entities improves performance, yet task-specific finetuning yields even stronger generalization without relying on language priors. Our findings suggest that current VLM failures on visual tasks reflect learned shortcuts from their training, rather than a fundamental limitation of multimodal architectures.

VLMs Need Words: Vision Language Models Ignore Visual Detail In Favor of Semantic Anchors

Abstract

Vision Language Models (VLMs) achieve impressive performance across a wide range of multimodal tasks. However, on some tasks that demand fine-grained visual perception, they often fail even when the required information is present in their internal representations. In this work, we demonstrate that this gap arises from their narrow training pipeline which focuses on moving visual information to the textual space. Consequently, VLMs can only reason about visual entities that can be mapped to known concepts in the language space, leaving vision-focused tasks such as visual correspondence and reasoning about novel visual entities poorly supported. As a result, VLMs are severely limited in several important multimodal capabilities because they rely on brittle, hallucinated textual descriptions of visual entities that they cannot map to textual representations. We verify this behavior through visual correspondence tasks, in which VLMs must detect matching entities between two images. Testing across semantic, shape, and face correspondence tasks, we find that VLMs perform much better when the relevant entities are nameable in language than when they are unnameable. Mechanistically, our Logit Lens analyses confirm that VLMs explicitly assign semantic labels to nameable entities and surface more unique corresponding tokens compared to unnameable entities. Furthermore, we show that teaching completely arbitrary names for unknown entities improves performance, yet task-specific finetuning yields even stronger generalization without relying on language priors. Our findings suggest that current VLM failures on visual tasks reflect learned shortcuts from their training, rather than a fundamental limitation of multimodal architectures.

Paper Structure

This paper contains 29 sections, 17 figures, 8 tables.

Figures (17)

  • Figure 1: Overview of the correspondence task framed as multiple-choice VQA. A reference point labeled "REF" in the first image must be matched to one of four candidate regions (A, B, C, D) in the second image. We evaluate VLMs in three setups: Direct answer, Chain-of-Thought, and Representation Probing.
  • Figure 2: Example of our 2D shape correspondence task. Face correspondence example is presented in Appendix Fig. \ref{['fig:face_corr_task']}.
  • Figure 3: Logit Lens analysis. Top row: layerwise decoded tokens for a known shape (star) and a known face (Jungkook) in Gemma3-12B, showing the progression from semantically unrelated tokens to exact labels and encyclopedic associations. Bottom row: Mean Jaccard Distance across layers for shapes and faces in Gemma3-12B and Qwen3VL-8B. Known entities (blue) yield consistently higher Mean Jaccard Distance than unknown entities (orange), confirming greater semantic discernibility in the hidden representations.
  • Figure 3: Direct VQA accuracy on unknown shape correspondence after learning arbitrary names. Rep. Probe (pre-finetuning) shown as reference.
  • Figure 4: Direct VQA accuracy on shape correspondence after finetuning on squiggles ($n{=}30$). Base Acc. is pre-finetuning. FT Acc. is after finetuning.
  • ...and 12 more figures