Table of Contents
Fetching ...

Understanding Multimodal LLMs: the Mechanistic Interpretability of Llava in Visual Question Answering

Zeping Yu, Sophia Ananiadou

TL;DR

This paper applies mechanistic interpretability to multimodal LLMs by dissecting Llava's VQA through comparisons with Vicuna's textual QA, revealing that visual embeddings encode color and animal information and that deep-layer attention mirrors textual QA mechanics. The authors also demonstrate that Llava enhances Vicuna via visual instruction tuning and introduce a low-cost interpretability tool to identify influential image regions and investigate visual hallucination. The findings support a unified view of VQA and TQA mechanisms in deep layers, offer a practical method for real-time explanations, and have implications for debugging and improving reliability in vision-language systems. Overall, the work advances understanding of MLLM mechanisms and provides actionable tooling for interpretability and safety in VQA tasks.

Abstract

Understanding the mechanisms behind Large Language Models (LLMs) is crucial for designing improved models and strategies. While recent studies have yielded valuable insights into the mechanisms of textual LLMs, the mechanisms of Multi-modal Large Language Models (MLLMs) remain underexplored. In this paper, we apply mechanistic interpretability methods to analyze the visual question answering (VQA) mechanisms in the first MLLM, Llava. We compare the mechanisms between VQA and textual QA (TQA) in color answering tasks and find that: a) VQA exhibits a mechanism similar to the in-context learning mechanism observed in TQA; b) the visual features exhibit significant interpretability when projecting the visual embeddings into the embedding space; and c) Llava enhances the existing capabilities of the corresponding textual LLM Vicuna during visual instruction tuning. Based on these findings, we develop an interpretability tool to help users and researchers identify important visual locations for final predictions, aiding in the understanding of visual hallucination. Our method demonstrates faster and more effective results compared to existing interpretability approaches. Code: \url{https://github.com/zepingyu0512/llava-mechanism}

Understanding Multimodal LLMs: the Mechanistic Interpretability of Llava in Visual Question Answering

TL;DR

This paper applies mechanistic interpretability to multimodal LLMs by dissecting Llava's VQA through comparisons with Vicuna's textual QA, revealing that visual embeddings encode color and animal information and that deep-layer attention mirrors textual QA mechanics. The authors also demonstrate that Llava enhances Vicuna via visual instruction tuning and introduce a low-cost interpretability tool to identify influential image regions and investigate visual hallucination. The findings support a unified view of VQA and TQA mechanisms in deep layers, offer a practical method for real-time explanations, and have implications for debugging and improving reliability in vision-language systems. Overall, the work advances understanding of MLLM mechanisms and provides actionable tooling for interpretability and safety in VQA tasks.

Abstract

Understanding the mechanisms behind Large Language Models (LLMs) is crucial for designing improved models and strategies. While recent studies have yielded valuable insights into the mechanisms of textual LLMs, the mechanisms of Multi-modal Large Language Models (MLLMs) remain underexplored. In this paper, we apply mechanistic interpretability methods to analyze the visual question answering (VQA) mechanisms in the first MLLM, Llava. We compare the mechanisms between VQA and textual QA (TQA) in color answering tasks and find that: a) VQA exhibits a mechanism similar to the in-context learning mechanism observed in TQA; b) the visual features exhibit significant interpretability when projecting the visual embeddings into the embedding space; and c) Llava enhances the existing capabilities of the corresponding textual LLM Vicuna during visual instruction tuning. Based on these findings, we develop an interpretability tool to help users and researchers identify important visual locations for final predictions, aiding in the understanding of visual hallucination. Our method demonstrates faster and more effective results compared to existing interpretability approaches. Code: \url{https://github.com/zepingyu0512/llava-mechanism}

Paper Structure

This paper contains 15 sections, 8 equations, 15 figures.

Figures (15)

  • Figure 1: (a) Overall structure of Llava for VQA. The input of Llava is an image and a question. The image $X_v$ is transformed into image embeddings $H_v$ by a projection $W$ and the CLIP visual encoder. The question $X_q$ is transformed into question embeddings $H_q$ by the embedding layer. The model generates the answer $X_a$ based on $H_v$ and $H_q$. (b) Mechanism of textual QA in Vicuna. In shallow layers, the color position ('brown') extracts the animal features ('dog'). In deep layers' attention heads, the value-output matrices extract the color features ('brown') and the query-key matrices compute the similarity score between the last position (encoding the question about dog) and the color position's features ('dog'). The larger the similarity score, the higher probability of the final prediction 'brown'. (c) Mechanism of visual QA. The visual embeddings already contain the color features (brown, white) and the animal features (dog, cat). In deep layers' attention heads, the value-output matrices extract the color features and the query-key matrices compute the similarity between the last position (encoding the question about dog) and each position (encoding dog/cat).
  • Figure 2: Identifying important image patches related to final predictions.
  • Figure 3: Analysis of color position's information storage in Vicuna TQA. (a) Color position value-output vector's information storage for correct color/random color. (b) Color position layer input vector's information storage for correct animal/random animal. (c) Color position's attention score when the question has the same/different animal with the textual context.
  • Figure 4: Analysis of top20 important positions' information storage in Llava VQA. (a) Top20 position value-output vectors' information storage for correct color/random color. (b) Top20 position layer input vectors' information storage for correct animal/random animal. (c) Top20 positions' sum attention score when the question has the same/different animal with the image.
  • Figure 5: Top10 important heads in Vicuna TQA, Llava TQA and Llava VQA.
  • ...and 10 more figures