Table of Contents
Fetching ...

Hidden in plain sight: VLMs overlook their visual representations

Stephanie Fu, Tyler Bonnen, Devin Guillory, Trevor Darrell

TL;DR

This work interrogates why vision-language models (VLMs) struggle on vision-centric tasks by directly comparing them to the visual encoders they incorporate. By evaluating depth estimation, pixel-level and semantic correspondences, 3D object awareness, and art-style matching, the authors show a consistent drop from encoder-based performance to VLM performance, with vision representations largely preserved across VLM layers. They identify three bottlenecks: degradation-free visual features, limited prompt sensitivity, and, most critically, the LLM's underutilization of visual information and its language priors; finetuning the LLM yields the largest gains and reduces priors. The findings imply that improvements in VLM capabilities on vision-centric tasks require stronger integration between vision representations and the LLM, rather than solely upgrading vision encoders. Overall, the study provides a diagnostic framework for understanding and improving how VLMs leverage their visual backbones in multimodal reasoning.

Abstract

Language provides a natural interface to specify and evaluate performance on visual tasks. To realize this possibility, vision language models (VLMs) must successfully integrate visual and linguistic information. Our work compares VLMs to a direct readout of their visual encoders to understand their ability to integrate across these modalities. Across a series of vision-centric benchmarks (e.g., depth estimation, correspondence), we find that VLMs perform substantially worse than their visual encoders, dropping to near-chance performance. We investigate these results through a series of analyses across the entire VLM: namely 1) the degradation of vision representations, 2) brittleness to task prompt, and 3) the language model's role in solving the task. We find that the bottleneck in performing these vision-centric tasks lies in this third category; VLMs are not effectively using visual information easily accessible throughout the entire model, and they inherit the language priors present in the LLM. Our work helps diagnose the failure modes of open-source VLMs, and presents a series of evaluations useful for future investigations into visual understanding within VLMs.

Hidden in plain sight: VLMs overlook their visual representations

TL;DR

This work interrogates why vision-language models (VLMs) struggle on vision-centric tasks by directly comparing them to the visual encoders they incorporate. By evaluating depth estimation, pixel-level and semantic correspondences, 3D object awareness, and art-style matching, the authors show a consistent drop from encoder-based performance to VLM performance, with vision representations largely preserved across VLM layers. They identify three bottlenecks: degradation-free visual features, limited prompt sensitivity, and, most critically, the LLM's underutilization of visual information and its language priors; finetuning the LLM yields the largest gains and reduces priors. The findings imply that improvements in VLM capabilities on vision-centric tasks require stronger integration between vision representations and the LLM, rather than solely upgrading vision encoders. Overall, the study provides a diagnostic framework for understanding and improving how VLMs leverage their visual backbones in multimodal reasoning.

Abstract

Language provides a natural interface to specify and evaluate performance on visual tasks. To realize this possibility, vision language models (VLMs) must successfully integrate visual and linguistic information. Our work compares VLMs to a direct readout of their visual encoders to understand their ability to integrate across these modalities. Across a series of vision-centric benchmarks (e.g., depth estimation, correspondence), we find that VLMs perform substantially worse than their visual encoders, dropping to near-chance performance. We investigate these results through a series of analyses across the entire VLM: namely 1) the degradation of vision representations, 2) brittleness to task prompt, and 3) the language model's role in solving the task. We find that the bottleneck in performing these vision-centric tasks lies in this third category; VLMs are not effectively using visual information easily accessible throughout the entire model, and they inherit the language priors present in the LLM. Our work helps diagnose the failure modes of open-source VLMs, and presents a series of evaluations useful for future investigations into visual understanding within VLMs.

Paper Structure

This paper contains 36 sections, 17 figures, 3 tables.

Figures (17)

  • Figure 1: Evaluating vision language models (VLMs) alongside their vision encoders reveals a failure to utilize visual information. To assess VLMs' visual abilities, we compare their performance to the accuracy supported by a direct readout of their visual encoders. Using 'vision-centric' tasks (e.g., visual correspondence), we compare typical VQA-style VLM evaluation (center, bottom) with vision-only methods (center, top). Across tasks, performance plummets from the 'Visual' to 'VLM' evaluations, often from near-ceiling to random chance. We study this trend by analyzing vision representation quality, prompt sensitivity, and the LLM's ability to leverage visual information.
  • Figure 2: Comparing standard visual evaluation to VLMs across vision-centric tasks. Shifting from a standard vision evaluation strategy to a VLM evaluation results in a performance drop, often to chance-level accuracies. Additionally, the vision encoders that perform best at a task (often DINOv2) are not the same vision encoders in more performant VLMs.
  • Figure 3: We find the same trends as in Fig. \ref{['fig:main_results']} for common open-source VLMs. We also note that these VLMs instruction-tune their vision encoders along with the rest of the VLM, so they are designed to be most performant when used in tandem with their projector and LLM. Nevertheless, we still see higher task performance when probing the vision representations alone than when querying the VLM.
  • Figure 4: VLM choice behavior reflects the biases of their LLMs. Here we visualize the distribution of answers when models are presented with (blue) and without (orange) a valid image. We find that behaviors largely reflect the pattern of choices in the blind baselines. We take this as evidence that VLMs are not simply misuing their visual representations, but they inherit their blind biases.
  • Figure 5: Visual evaluations for intermediate VLM layers. We probe vision representations throughout the projector (gray region) and LLM (white region) layers, finding that they generally preserve task-relevant information and show no significant degradation.
  • ...and 12 more figures