Table of Contents
Fetching ...

Visual Representations inside the Language Model

Benlin Liu, Amita Kamath, Madeleine Grunde-McLaughlin, Winson Han, Ranjay Krishna

TL;DR

This work investigates how visual information is represented and used inside multimodal language models by examining the visual key-value tokens stored in the KV cache of three MLMs. It finds that image value tokens carry sufficient information to support several zero-shot perception tasks and that the flow of this information through the model correlates with downstream perception performance, though MLM finetuning can reduce visual fidelity relative to non-MLM- finetuned encoders. A key discovery is that input-agnostic image keys in later layers can encode artifacts that degrade perception, while text prefixes can steer visual representations to improve performance. The study highlights the need for better control and prompting mechanisms to surface latent visual information, suggests new training directions for both the visual encoder and the language model, and lays groundwork for mechanistic interpretability of visual representations in MLMs.

Abstract

Despite interpretability work analyzing VIT encoders and transformer activations, we don't yet understand why Multimodal Language Models (MLMs) struggle on perception-heavy tasks. We offer an under-studied perspective by examining how popular MLMs (LLaVA-OneVision, Qwen2.5-VL, and Llama-3-LLaVA-NeXT) process their visual key-value tokens. We first study the flow of visual information through the language model, finding that image value tokens encode sufficient information to perform several perception-heavy tasks zero-shot: segmentation, semantic correspondence, temporal correspondence, and referring expression detection. We find that while the language model does augment the visual information received from the projection of input visual encodings-which we reveal correlates with overall MLM perception capability-it contains less visual information on several tasks than the equivalent visual encoder (SigLIP) that has not undergone MLM finetuning. Further, we find that the visual information corresponding to input-agnostic image key tokens in later layers of language models contains artifacts which reduce perception capability of the overall MLM. Next, we discuss controlling visual information in the language model, showing that adding a text prefix to the image input improves perception capabilities of visual representations. Finally, we reveal that if language models were able to better control their visual information, their perception would significantly improve; e.g., in 33.3% of Art Style questions in the BLINK benchmark, perception information present in the language model is not surfaced to the output! Our findings reveal insights into the role of key-value tokens in multimodal systems, paving the way for deeper mechanistic interpretability of MLMs and suggesting new directions for training their visual encoder and language model components.

Visual Representations inside the Language Model

TL;DR

This work investigates how visual information is represented and used inside multimodal language models by examining the visual key-value tokens stored in the KV cache of three MLMs. It finds that image value tokens carry sufficient information to support several zero-shot perception tasks and that the flow of this information through the model correlates with downstream perception performance, though MLM finetuning can reduce visual fidelity relative to non-MLM- finetuned encoders. A key discovery is that input-agnostic image keys in later layers can encode artifacts that degrade perception, while text prefixes can steer visual representations to improve performance. The study highlights the need for better control and prompting mechanisms to surface latent visual information, suggests new training directions for both the visual encoder and the language model, and lays groundwork for mechanistic interpretability of visual representations in MLMs.

Abstract

Despite interpretability work analyzing VIT encoders and transformer activations, we don't yet understand why Multimodal Language Models (MLMs) struggle on perception-heavy tasks. We offer an under-studied perspective by examining how popular MLMs (LLaVA-OneVision, Qwen2.5-VL, and Llama-3-LLaVA-NeXT) process their visual key-value tokens. We first study the flow of visual information through the language model, finding that image value tokens encode sufficient information to perform several perception-heavy tasks zero-shot: segmentation, semantic correspondence, temporal correspondence, and referring expression detection. We find that while the language model does augment the visual information received from the projection of input visual encodings-which we reveal correlates with overall MLM perception capability-it contains less visual information on several tasks than the equivalent visual encoder (SigLIP) that has not undergone MLM finetuning. Further, we find that the visual information corresponding to input-agnostic image key tokens in later layers of language models contains artifacts which reduce perception capability of the overall MLM. Next, we discuss controlling visual information in the language model, showing that adding a text prefix to the image input improves perception capabilities of visual representations. Finally, we reveal that if language models were able to better control their visual information, their perception would significantly improve; e.g., in 33.3% of Art Style questions in the BLINK benchmark, perception information present in the language model is not surfaced to the output! Our findings reveal insights into the role of key-value tokens in multimodal systems, paving the way for deeper mechanistic interpretability of MLMs and suggesting new directions for training their visual encoder and language model components.

Paper Structure

This paper contains 22 sections, 4 equations, 19 figures, 10 tables.

Figures (19)

  • Figure 1: We study visual representations in the key-value cache within Mutimodal Language Models (MLMs), as they are uninfluenced by text (due to the causal nature of cross-modal attention) and directly contribute to the MLM output. Despite MLMs struggling on perception fu2024blinktong2024eyes, we find that their intermediate image value tokens encode sufficient information for various zero-shot perception tasks---calling for research to understand and surface the visual information already present within MLMs.
  • Figure 2: Performance of each image value (bottom), and maximum per-layer image value performance (top). On segmentation tasks (left), visual information builds gradually in the first two-thirds of the language model (LM), then drops steeply. On correspondence tasks (right), visual information builds sharply after the first third of the LM, then drops steadily (note the scale of the Y-axis). The LM builds upon the input visual representation (in red).
  • Figure 2: Performance of visual representations on correspondence probing tasks. The highest-performing image value in the LM compares favorably to strong vision encoders.
  • Figure 3: Segmentation performance of image values correlates with MLM perception on downstream tasks.
  • Figure 4: PCA visualization of all image keys in LLaVA-OneVision 7B for two COCO images. The input-agnostic image keys (highlighted) alone are nearly constant across images.
  • ...and 14 more figures