Table of Contents
Fetching ...

Unveiling Visual Perception in Language Models: An Attention Head Analysis Approach

Jing Bi, Junjia Guo, Yunlong Tang, Lianggong Bruce Wen, Zhang Liu, Chenliang Xu

TL;DR

The paper addresses how language-models, trained primarily on text, interpret visual inputs in multimodal settings by identifying a distinct class of visual-attention heads. It introduces an entropy-based concentration metric and a head-detection score to quantify visual-head behavior, and demonstrates these heads concentrate in early/middle layers across multiple model families and training strategies. Key findings include the correlation between higher attention concentration and improved performance, dynamic head activation depending on context, and actionable pruning opportunities to improve efficiency. The work advances understanding of multimodal integration in LLMs and offers practical insights for building more efficient, visually capable AI systems.

Abstract

Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated remarkable progress in visual understanding. This impressive leap raises a compelling question: how can language models, initially trained solely on linguistic data, effectively interpret and process visual content? This paper aims to address this question with systematic investigation across 4 model families and 4 model scales, uncovering a unique class of attention heads that focus specifically on visual content. Our analysis reveals a strong correlation between the behavior of these attention heads, the distribution of attention weights, and their concentration on visual tokens within the input. These findings enhance our understanding of how LLMs adapt to multimodal tasks, demonstrating their potential to bridge the gap between textual and visual understanding. This work paves the way for the development of AI systems capable of engaging with diverse modalities.

Unveiling Visual Perception in Language Models: An Attention Head Analysis Approach

TL;DR

The paper addresses how language-models, trained primarily on text, interpret visual inputs in multimodal settings by identifying a distinct class of visual-attention heads. It introduces an entropy-based concentration metric and a head-detection score to quantify visual-head behavior, and demonstrates these heads concentrate in early/middle layers across multiple model families and training strategies. Key findings include the correlation between higher attention concentration and improved performance, dynamic head activation depending on context, and actionable pruning opportunities to improve efficiency. The work advances understanding of multimodal integration in LLMs and offers practical insights for building more efficient, visually capable AI systems.

Abstract

Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated remarkable progress in visual understanding. This impressive leap raises a compelling question: how can language models, initially trained solely on linguistic data, effectively interpret and process visual content? This paper aims to address this question with systematic investigation across 4 model families and 4 model scales, uncovering a unique class of attention heads that focus specifically on visual content. Our analysis reveals a strong correlation between the behavior of these attention heads, the distribution of attention weights, and their concentration on visual tokens within the input. These findings enhance our understanding of how LLMs adapt to multimodal tasks, demonstrating their potential to bridge the gap between textual and visual understanding. This work paves the way for the development of AI systems capable of engaging with diverse modalities.

Paper Structure

This paper contains 20 sections, 3 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Attention vs Concentration across Different Models and Datasets: Models are represented on the horizontal axis, and datasets are shown on the vertical axis within each subplot, with points representing individual attention heads. The x-axis represents the total attention weight assigned to visual tokens. The y-axis indicates the concentration of attention weights, where higher values signify focused attention on specific areas.The digits at the end of each subplot title represent the model’s accuracy under that dataset. Attention heads tend to fail under conditions of high content complexity, high weight, or both. Notably, attention distribution for each model remains consistent across different datasets, especially for the 13B model, which demonstrates low variation across datasets and models. This suggests that the proposed metrics—total attention weight and concentration—are reliable indicators of consistent model behavior across diverse datasets.
  • Figure 2: Example entries in the PointQA Dataset
  • Figure 3: The image heatmap visualizes the total attention weight across image tokens and image region tokens. Attention is concentrated in specific layers, particularly in the early and middle layers. Comparing visual and plain generative tasks, we observe that the bounding box does not alter the attention head patterns. However, when comparing with a plain object prompt, including the object name in the question prompt activates additional attention heads not triggered by the visual prompt, suggesting that the attention heads exhibit dynamic activation based on the context—whether visual or linguistic. This highlights their ability to adjust their function and behavior in response to changing inputs. Further comparison between versions 1.6 and 1.5 demonstrates an improvement in image attention across all layers in version 1.6. However, this pattern is not as evident in the 1.6 13B model. The region token attention is omitted in 1.6 due to the more complex handling of the input image, making it challenging to track bbox token indices. Additionally, we see that the visual prompt does not improve the attention head’s focus on specific regions, as evidenced by comparing the first and second rows of the heatmap.
  • Figure 4: The visual heads of models within the same family exhibit a strong correlation, meaning that models of the same type typically share the same set of visual heads. In contrast, the visual heads of models from different families are distinctly different.
  • Figure 5: The attention weights demonstrate inconsistencies when applied to different datasets.
  • ...and 3 more figures