Table of Contents
Fetching ...

Cross-modal Information Flow in Multimodal Large Language Models

Zhi Zhang, Srishti Yadav, Fengze Han, Ekaterina Shutova

TL;DR

The paper investigates how vision and language interact inside auto-regressive multimodal large language models during visual question answering. It introduces an attention knockout methodology to trace cross-modal information flow across layers in several LLaVA-based models, revealing a two-stage integration: first a broad, global fusion of whole-image features into question representations in lower layers, followed by a targeted fusion of object-specific visual cues into question tokens in middle layers, with the integrated state then propagating to the final prediction. The findings show that the final answer probability develops in middle layers after multimodal integration, with semantic generation giving way to syntactic refinement in higher layers, and that the question inputs directly steer the final decision while visual inputs often influence the question representations indirectly. These insights advance mechanistic interpretability for multimodal models and suggest practical directions for efficiency and robust design, including potential token-compression strategies in higher layers; the authors also release code and a collected dataset for reproducibility and further research.

Abstract

The recent advancements in auto-regressive multimodal large language models (MLLMs) have demonstrated promising progress for vision-language tasks. While there exists a variety of studies investigating the processing of linguistic information within large language models, little is currently known about the inner working mechanism of MLLMs and how linguistic and visual information interact within these models. In this study, we aim to fill this gap by examining the information flow between different modalities -- language and vision -- in MLLMs, focusing on visual question answering. Specifically, given an image-question pair as input, we investigate where in the model and how the visual and linguistic information are combined to generate the final prediction. Conducting experiments with a series of models from the LLaVA series, we find that there are two distinct stages in the process of integration of the two modalities. In the lower layers, the model first transfers the more general visual features of the whole image into the representations of (linguistic) question tokens. In the middle layers, it once again transfers visual information about specific objects relevant to the question to the respective token positions of the question. Finally, in the higher layers, the resulting multimodal representation is propagated to the last position of the input sequence for the final prediction. Overall, our findings provide a new and comprehensive perspective on the spatial and functional aspects of image and language processing in the MLLMs, thereby facilitating future research into multimodal information localization and editing. Our code and collected dataset are released here: https://github.com/FightingFighting/cross-modal-information-flow-in-MLLM.git.

Cross-modal Information Flow in Multimodal Large Language Models

TL;DR

The paper investigates how vision and language interact inside auto-regressive multimodal large language models during visual question answering. It introduces an attention knockout methodology to trace cross-modal information flow across layers in several LLaVA-based models, revealing a two-stage integration: first a broad, global fusion of whole-image features into question representations in lower layers, followed by a targeted fusion of object-specific visual cues into question tokens in middle layers, with the integrated state then propagating to the final prediction. The findings show that the final answer probability develops in middle layers after multimodal integration, with semantic generation giving way to syntactic refinement in higher layers, and that the question inputs directly steer the final decision while visual inputs often influence the question representations indirectly. These insights advance mechanistic interpretability for multimodal models and suggest practical directions for efficiency and robust design, including potential token-compression strategies in higher layers; the authors also release code and a collected dataset for reproducibility and further research.

Abstract

The recent advancements in auto-regressive multimodal large language models (MLLMs) have demonstrated promising progress for vision-language tasks. While there exists a variety of studies investigating the processing of linguistic information within large language models, little is currently known about the inner working mechanism of MLLMs and how linguistic and visual information interact within these models. In this study, we aim to fill this gap by examining the information flow between different modalities -- language and vision -- in MLLMs, focusing on visual question answering. Specifically, given an image-question pair as input, we investigate where in the model and how the visual and linguistic information are combined to generate the final prediction. Conducting experiments with a series of models from the LLaVA series, we find that there are two distinct stages in the process of integration of the two modalities. In the lower layers, the model first transfers the more general visual features of the whole image into the representations of (linguistic) question tokens. In the middle layers, it once again transfers visual information about specific objects relevant to the question to the respective token positions of the question. Finally, in the higher layers, the resulting multimodal representation is propagated to the last position of the input sequence for the final prediction. Overall, our findings provide a new and comprehensive perspective on the spatial and functional aspects of image and language processing in the MLLMs, thereby facilitating future research into multimodal information localization and editing. Our code and collected dataset are released here: https://github.com/FightingFighting/cross-modal-information-flow-in-MLLM.git.

Paper Structure

This paper contains 55 sections, 8 equations, 37 figures, 2 tables.

Figures (37)

  • Figure 1: Illustration of the internal mechanism of MLLMs when solving multimodal tasks. From bottom to top layers, the model first propagates general visual information from the whole image into the linguistic hidden representation; next, selected visual information relevant to answering the question is transferred to the linguistic representation; finally, the integrated multimodal information within the hidden representation of the question flows to last position facilitating the final prediction. In addition, the answers are initially generated in lowercase form and then converted to uppercase for the first letter.
  • Figure 2: The typical architecture of multimodal large language model. It consists of an image encoder and a decoder-only large language model in which the multimodal information is integrated. We omitted the projection matrix for the visual patch feature as it is nonessential for our analysis.
  • Figure 3: The relative changes in prediction probability on LLaVA-1.5-13b with six VQA tasks. The Question$\nrightarrow$Last, Image$\nrightarrow$Last and Last$\nrightarrow$Last represent preventing last position from attending to Question, Image and itself respectively.
  • Figure 4: The relative changes in prediction probability when blocking attention edges from the question positions to the image positions on LLaVA-1.5-13b with six VQA tasks.
  • Figure 5: The relative changes in prediction probability on LLaVA-1.5-13b with six VQA tasks. Related Image Patches$\nrightarrow$question and Other Image Patches$\nrightarrow$question represent blocking the position of question from attending to that of different image patches, region of interest and remainder, respectively.
  • ...and 32 more figures