Table of Contents
Fetching ...

Explaining multimodal LLMs via intra-modal token interactions

Jiawei Liang, Ruoyu Chen, Xianghao Jiao, Siyuan Liang, Shiming Liu, Qunli Zhang, Zheng Hu, Xiaochun Cao

TL;DR

The paper tackles the interpretability of Multimodal Large Language Models by addressing intra-modal interactions that prior cross-modal attribution methods overlook. It introduces Multi-Scale Explanation Aggregation (MSEA) to fuse visual attributions across multiple image scales and Activation Ranking Correlation (ARC) to suppress noisy activations from preceding textual context using ranking alignment. Across state-of-the-art MLLMs (e.g., LLaVA-1.5, Qwen2-VL, InternVL2.5) and benchmarks COCO Caption, GranDf, and OpenPSG, the approach yields more faithful and fine-grained explanations, with F1-IoU gains up to about 14.5 percentage points and substantial reductions in false positives. The proposed post-hoc framework, which relies on intra-modal dynamics and does not require retraining, enhances transparency and reliability of vision-language reasoning in diverse models and scales.

Abstract

Multimodal Large Language Models (MLLMs) have achieved remarkable success across diverse vision-language tasks, yet their internal decision-making mechanisms remain insufficiently understood. Existing interpretability research has primarily focused on cross-modal attribution, identifying which image regions the model attends to during output generation. However, these approaches often overlook intra-modal dependencies. In the visual modality, attributing importance to isolated image patches ignores spatial context due to limited receptive fields, resulting in fragmented and noisy explanations. In the textual modality, reliance on preceding tokens introduces spurious activations. Failing to effectively mitigate these interference compromises attribution fidelity. To address these limitations, we propose enhancing interpretability by leveraging intra-modal interaction. For the visual branch, we introduce \textit{Multi-Scale Explanation Aggregation} (MSEA), which aggregates attributions over multi-scale inputs to dynamically adjust receptive fields, producing more holistic and spatially coherent visual explanations. For the textual branch, we propose \textit{Activation Ranking Correlation} (ARC), which measures the relevance of contextual tokens to the current token via alignment of their top-$k$ prediction rankings. ARC leverages this relevance to suppress spurious activations from irrelevant contexts while preserving semantically coherent ones. Extensive experiments across state-of-the-art MLLMs and benchmark datasets demonstrate that our approach consistently outperforms existing interpretability methods, yielding more faithful and fine-grained explanations of model behavior.

Explaining multimodal LLMs via intra-modal token interactions

TL;DR

The paper tackles the interpretability of Multimodal Large Language Models by addressing intra-modal interactions that prior cross-modal attribution methods overlook. It introduces Multi-Scale Explanation Aggregation (MSEA) to fuse visual attributions across multiple image scales and Activation Ranking Correlation (ARC) to suppress noisy activations from preceding textual context using ranking alignment. Across state-of-the-art MLLMs (e.g., LLaVA-1.5, Qwen2-VL, InternVL2.5) and benchmarks COCO Caption, GranDf, and OpenPSG, the approach yields more faithful and fine-grained explanations, with F1-IoU gains up to about 14.5 percentage points and substantial reductions in false positives. The proposed post-hoc framework, which relies on intra-modal dynamics and does not require retraining, enhances transparency and reliability of vision-language reasoning in diverse models and scales.

Abstract

Multimodal Large Language Models (MLLMs) have achieved remarkable success across diverse vision-language tasks, yet their internal decision-making mechanisms remain insufficiently understood. Existing interpretability research has primarily focused on cross-modal attribution, identifying which image regions the model attends to during output generation. However, these approaches often overlook intra-modal dependencies. In the visual modality, attributing importance to isolated image patches ignores spatial context due to limited receptive fields, resulting in fragmented and noisy explanations. In the textual modality, reliance on preceding tokens introduces spurious activations. Failing to effectively mitigate these interference compromises attribution fidelity. To address these limitations, we propose enhancing interpretability by leveraging intra-modal interaction. For the visual branch, we introduce \textit{Multi-Scale Explanation Aggregation} (MSEA), which aggregates attributions over multi-scale inputs to dynamically adjust receptive fields, producing more holistic and spatially coherent visual explanations. For the textual branch, we propose \textit{Activation Ranking Correlation} (ARC), which measures the relevance of contextual tokens to the current token via alignment of their top- prediction rankings. ARC leverages this relevance to suppress spurious activations from irrelevant contexts while preserving semantically coherent ones. Extensive experiments across state-of-the-art MLLMs and benchmark datasets demonstrate that our approach consistently outperforms existing interpretability methods, yielding more faithful and fine-grained explanations of model behavior.

Paper Structure

This paper contains 13 sections, 10 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Motivation of our proposed MSEA (a) and SAC (b).
  • Figure 2: Overview of our proposed framework.
  • Figure 3: Performance sensitivity to the number and range of scaling factors across datasets and model architectures. Subfigures (a) and (b) show the impact of varying the number of scaling factors, while subfigures (c) and (d) illustrate the effect of different ranges of scaling factors.
  • Figure 4: Visualization of attribution maps generated using the Qwen2-VL-2B model.
  • Figure 5: Visualization of attribution maps generated using the LLaVA-1.5-7B model.