Table of Contents
Fetching ...

Feature-level Interaction Explanations in Multimodal Transformers

Yeji Kim, Housam Khalifa Bashier Babiker, Mi-Young Kim, Randy Goebel

Abstract

Multimodal Transformers often produce predictions without clarifying how different modalities jointly support a decision. Most existing multimodal explainable AI (MXAI) methods extend unimodal saliency to multimodal backbones, highlighting important tokens or patches within each modality, but they rarely pinpoint which cross-modal feature pairs provide complementary evidence (synergy) or serve as reliable backups (redundancy). We present Feature-level I2MoE (FL-I2MoE), a structured Mixture-of-Experts layer that operates directly on token/patch sequences from frozen pretrained encoders and explicitly separates unique, synergistic, and redundant evidence at the feature level. We further develop an expert-wise explanation pipeline that combines attribution with top-K% masking to assess faithfulness, and we introduce Monte Carlo interaction probes to quantify pairwise behavior: the Shapley Interaction Index (SII) to score synergistic pairs and a redundancy-gap score to capture substitutable (redundant) pairs. Across three benchmarks (MMIMDb, ENRICO, and MMHS150K), FL-I2MoE yields more interactionspecific and concentrated importance patterns than a dense Transformer with the same encoders. Finally, pair-level masking shows that removing pairs ranked by SII or redundancy-gap degrades performance more than masking randomly chosen pairs under the same budget, supporting that the identified interactions are causally relevant.

Feature-level Interaction Explanations in Multimodal Transformers

Abstract

Multimodal Transformers often produce predictions without clarifying how different modalities jointly support a decision. Most existing multimodal explainable AI (MXAI) methods extend unimodal saliency to multimodal backbones, highlighting important tokens or patches within each modality, but they rarely pinpoint which cross-modal feature pairs provide complementary evidence (synergy) or serve as reliable backups (redundancy). We present Feature-level I2MoE (FL-I2MoE), a structured Mixture-of-Experts layer that operates directly on token/patch sequences from frozen pretrained encoders and explicitly separates unique, synergistic, and redundant evidence at the feature level. We further develop an expert-wise explanation pipeline that combines attribution with top-K% masking to assess faithfulness, and we introduce Monte Carlo interaction probes to quantify pairwise behavior: the Shapley Interaction Index (SII) to score synergistic pairs and a redundancy-gap score to capture substitutable (redundant) pairs. Across three benchmarks (MMIMDb, ENRICO, and MMHS150K), FL-I2MoE yields more interactionspecific and concentrated importance patterns than a dense Transformer with the same encoders. Finally, pair-level masking shows that removing pairs ranked by SII or redundancy-gap degrades performance more than masking randomly chosen pairs under the same budget, supporting that the identified interactions are causally relevant.
Paper Structure (27 sections, 8 equations, 5 figures, 2 tables)

This paper contains 27 sections, 8 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Input modalities for the qualitative MM-IMDb example: (a) movie poster image and (b) plot summary (Four Minutes, true labels: drama, music).
  • Figure 2: Overview of the compared architectures and our analysis pipeline. (a) dense Transformer with pooled modality encoders and a single fusion module. (b) Original I$^2$MoE with uniqueness, synergy, and redundancy experts operating on pooled modality vectors. (c) Proposed FL-I$^2$MoE: patch- and token-level encoders feed interaction experts, which output class logits for the main task. In the dashed region, at inference time, we derive expert-wise feature importance maps, propagate them to select cross-modal token-patch pairs, and then compute synergy and redundancy-gap scores to identify synergistic or redundant pairs.
  • Figure 3: Average performance drop when masking the top-$K\%$ features according to different attribution methods (Random, AttnRoll, IG, and Grad$\times$AttnRoll).
  • Figure 4: Alignment between expert-wise importance and Monte Carlo interaction metrics. Top: mean SII (synergy expert). Bottom: mean redundancy gap $R_{\text{red}}$ (redundancy expert). Bars compare importance bins (x-axis) and top-$q\%$ sets.
  • Figure 5: Qualitative comparison on an MM-IMDb example. (a) dense Transformer with separate image and text saliency maps. (b) FL-I$^2$MoE. Top row: uni-text, uni-image, synergy, and redundancy experts. Bottom row: token-patch pairs by Monte Carlo interaction metrics for high synergy (SII) or redundancy gap.