Unveiling Visual Perception in Language Models: An Attention Head Analysis Approach
Jing Bi, Junjia Guo, Yunlong Tang, Lianggong Bruce Wen, Zhang Liu, Chenliang Xu
TL;DR
The paper addresses how language-models, trained primarily on text, interpret visual inputs in multimodal settings by identifying a distinct class of visual-attention heads. It introduces an entropy-based concentration metric and a head-detection score to quantify visual-head behavior, and demonstrates these heads concentrate in early/middle layers across multiple model families and training strategies. Key findings include the correlation between higher attention concentration and improved performance, dynamic head activation depending on context, and actionable pruning opportunities to improve efficiency. The work advances understanding of multimodal integration in LLMs and offers practical insights for building more efficient, visually capable AI systems.
Abstract
Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated remarkable progress in visual understanding. This impressive leap raises a compelling question: how can language models, initially trained solely on linguistic data, effectively interpret and process visual content? This paper aims to address this question with systematic investigation across 4 model families and 4 model scales, uncovering a unique class of attention heads that focus specifically on visual content. Our analysis reveals a strong correlation between the behavior of these attention heads, the distribution of attention weights, and their concentration on visual tokens within the input. These findings enhance our understanding of how LLMs adapt to multimodal tasks, demonstrating their potential to bridge the gap between textual and visual understanding. This work paves the way for the development of AI systems capable of engaging with diverse modalities.
