Table of Contents
Fetching ...

Understanding Information Storage and Transfer in Multi-modal Large Language Models

Samyadeep Basu, Martin Grayson, Cecily Morrison, Besmira Nushi, Soheil Feizi, Daniela Massiceti

TL;DR

This work investigates how multi-modal LLMs store factual information and transfer it to produce answers. It introduces a constraint-based framework and MultiModalCausalTrace to identify storage sites in early layers, plus attention-contribution analyses to map information flow, validated on LLaVa-7B and Phi-2 with the VQA-Constraints dataset. Key findings reveal that storage occurs in early MLP/self-attention layers, transfer uses a small set of visual tokens, and mid-layer attention moves information to the final token; a model-editing method, MultEdit, can correct errors and insert long-tailed knowledge by editing early causal blocks. Together, these contributions enable deeper mechanistic understanding and practical intervention for grounding and correcting multi-modal models.

Abstract

Understanding the mechanisms of information storage and transfer in Transformer-based models is important for driving model understanding progress. Recent work has studied these mechanisms for Large Language Models (LLMs), revealing insights on how information is stored in a model's parameters and how information flows to and from these parameters in response to specific prompts. However, these studies have not yet been extended to Multi-modal Large Language Models (MLLMs). Given their expanding capabilities and real-world use, we start by studying one aspect of these models -- how MLLMs process information in a factual visual question answering task. We use a constraint-based formulation which views a visual question as having a set of visual or textual constraints that the model's generated answer must satisfy to be correct (e.g. What movie directed by the director in this photo has won a Golden Globe?). Under this setting, we contribute i) a method that extends causal information tracing from pure language to the multi-modal setting, and ii) VQA-Constraints, a test-bed of 9.7K visual questions annotated with constraints. We use these tools to study two open-source MLLMs, LLaVa and multi-modal Phi-2. Our key findings show that these MLLMs rely on MLP and self-attention blocks in much earlier layers for information storage, compared to LLMs whose mid-layer MLPs are more important. We also show that a consistent small subset of visual tokens output by the vision encoder are responsible for transferring information from the image to these causal blocks. We validate these mechanisms by introducing MultEdit, a model-editing algorithm that can correct errors and insert new long-tailed information into MLLMs by targeting these causal blocks.

Understanding Information Storage and Transfer in Multi-modal Large Language Models

TL;DR

This work investigates how multi-modal LLMs store factual information and transfer it to produce answers. It introduces a constraint-based framework and MultiModalCausalTrace to identify storage sites in early layers, plus attention-contribution analyses to map information flow, validated on LLaVa-7B and Phi-2 with the VQA-Constraints dataset. Key findings reveal that storage occurs in early MLP/self-attention layers, transfer uses a small set of visual tokens, and mid-layer attention moves information to the final token; a model-editing method, MultEdit, can correct errors and insert long-tailed knowledge by editing early causal blocks. Together, these contributions enable deeper mechanistic understanding and practical intervention for grounding and correcting multi-modal models.

Abstract

Understanding the mechanisms of information storage and transfer in Transformer-based models is important for driving model understanding progress. Recent work has studied these mechanisms for Large Language Models (LLMs), revealing insights on how information is stored in a model's parameters and how information flows to and from these parameters in response to specific prompts. However, these studies have not yet been extended to Multi-modal Large Language Models (MLLMs). Given their expanding capabilities and real-world use, we start by studying one aspect of these models -- how MLLMs process information in a factual visual question answering task. We use a constraint-based formulation which views a visual question as having a set of visual or textual constraints that the model's generated answer must satisfy to be correct (e.g. What movie directed by the director in this photo has won a Golden Globe?). Under this setting, we contribute i) a method that extends causal information tracing from pure language to the multi-modal setting, and ii) VQA-Constraints, a test-bed of 9.7K visual questions annotated with constraints. We use these tools to study two open-source MLLMs, LLaVa and multi-modal Phi-2. Our key findings show that these MLLMs rely on MLP and self-attention blocks in much earlier layers for information storage, compared to LLMs whose mid-layer MLPs are more important. We also show that a consistent small subset of visual tokens output by the vision encoder are responsible for transferring information from the image to these causal blocks. We validate these mechanisms by introducing MultEdit, a model-editing algorithm that can correct errors and insert new long-tailed information into MLLMs by targeting these causal blocks.
Paper Structure (27 sections, 7 equations, 18 figures, 1 table)

This paper contains 27 sections, 7 equations, 18 figures, 1 table.

Figures (18)

  • Figure 1: retrieve information from earlier internal layers compared to their counterparts. We find that very early MLP layers [1-4] have high indirect estimation effects to outputs (i.e., they are causal) in LLaVa-7B, whereas the middle MLP layers [4-7] are causal in LLaMA (Vicuna)-7B. For LLaMA, a larger window size (e.g., 5) is also required to find causal sites, compared to a window size of 1 for LLaVA-7B.
  • Figure 2: We introduce MultiModalCausalTrace, a causal tracing method to understand information storage in . A clean model is corrupted by replacing the question's constraint with an incorrect one for the given image (e.g. "This place" --> "Paris city" for an image of "Vinson Massif"). The activations of windows of layers are then iteratively copied from the clean to the corrupted model until the corrupted model restores its output probability to match the clean model's.
  • Figure 3: Information to answer a visual question with a single constraint is mainly retrieved from early MLP and self-attention layers in .MultiModalCausalTrace obtains high indirect estimation effect values in LLaVa's early MLP and self-attention blocks corresponding to the visual constraint, across all 3 datasets in VQA-Constraints. This suggests these layers are causally important for information storage. The causal traces emerge with a window size of 3 (see results with a window size of 1 in \ref{['b_tracing']}).
  • Figure 4: Information to answer a visual question with a visual and textual constraint is retrieved from early and middle MLP and self-attention layers in . This suggests that meeting multiple constraint requires more parametric memory compared to single constraints. We show that MultiModalCausalTrace obtains high indirect estimation effect values in the early and middle layers in LLaVa’s on the OK-VQA dataset in VQA-Constraints (see multi-constraint results from the Movies dataset in \ref{['e_multiconstraint']}).
  • Figure 5: The late visual tokens are primarily responsible for transferring information from the image to the early causal layers, via the first self-attention layer. We visualize attention contributions (see Eq.(\ref{['eq_attn_contrib']})) from the visual tokens to the visual constraint token averaged across the three datasets in VQA-Constraints.
  • ...and 13 more figures