Understanding Multimodal LLMs: the Mechanistic Interpretability of Llava in Visual Question Answering
Zeping Yu, Sophia Ananiadou
TL;DR
This paper applies mechanistic interpretability to multimodal LLMs by dissecting Llava's VQA through comparisons with Vicuna's textual QA, revealing that visual embeddings encode color and animal information and that deep-layer attention mirrors textual QA mechanics. The authors also demonstrate that Llava enhances Vicuna via visual instruction tuning and introduce a low-cost interpretability tool to identify influential image regions and investigate visual hallucination. The findings support a unified view of VQA and TQA mechanisms in deep layers, offer a practical method for real-time explanations, and have implications for debugging and improving reliability in vision-language systems. Overall, the work advances understanding of MLLM mechanisms and provides actionable tooling for interpretability and safety in VQA tasks.
Abstract
Understanding the mechanisms behind Large Language Models (LLMs) is crucial for designing improved models and strategies. While recent studies have yielded valuable insights into the mechanisms of textual LLMs, the mechanisms of Multi-modal Large Language Models (MLLMs) remain underexplored. In this paper, we apply mechanistic interpretability methods to analyze the visual question answering (VQA) mechanisms in the first MLLM, Llava. We compare the mechanisms between VQA and textual QA (TQA) in color answering tasks and find that: a) VQA exhibits a mechanism similar to the in-context learning mechanism observed in TQA; b) the visual features exhibit significant interpretability when projecting the visual embeddings into the embedding space; and c) Llava enhances the existing capabilities of the corresponding textual LLM Vicuna during visual instruction tuning. Based on these findings, we develop an interpretability tool to help users and researchers identify important visual locations for final predictions, aiding in the understanding of visual hallucination. Our method demonstrates faster and more effective results compared to existing interpretability approaches. Code: \url{https://github.com/zepingyu0512/llava-mechanism}
