Feature-level Interaction Explanations in Multimodal Transformers

Yeji Kim; Housam Khalifa Bashier Babiker; Mi-Young Kim; Randy Goebel

Feature-level Interaction Explanations in Multimodal Transformers

Yeji Kim, Housam Khalifa Bashier Babiker, Mi-Young Kim, Randy Goebel

Abstract

Multimodal Transformers often produce predictions without clarifying how different modalities jointly support a decision. Most existing multimodal explainable AI (MXAI) methods extend unimodal saliency to multimodal backbones, highlighting important tokens or patches within each modality, but they rarely pinpoint which cross-modal feature pairs provide complementary evidence (synergy) or serve as reliable backups (redundancy). We present Feature-level I2MoE (FL-I2MoE), a structured Mixture-of-Experts layer that operates directly on token/patch sequences from frozen pretrained encoders and explicitly separates unique, synergistic, and redundant evidence at the feature level. We further develop an expert-wise explanation pipeline that combines attribution with top-K% masking to assess faithfulness, and we introduce Monte Carlo interaction probes to quantify pairwise behavior: the Shapley Interaction Index (SII) to score synergistic pairs and a redundancy-gap score to capture substitutable (redundant) pairs. Across three benchmarks (MMIMDb, ENRICO, and MMHS150K), FL-I2MoE yields more interactionspecific and concentrated importance patterns than a dense Transformer with the same encoders. Finally, pair-level masking shows that removing pairs ranked by SII or redundancy-gap degrades performance more than masking randomly chosen pairs under the same budget, supporting that the identified interactions are causally relevant.

Feature-level Interaction Explanations in Multimodal Transformers

Abstract

Paper Structure (27 sections, 8 equations, 5 figures, 2 tables)

This paper contains 27 sections, 8 equations, 5 figures, 2 tables.

Introduction
Related Work
Multimodal XAI and Feature Attribution
Quantifying Multimodal Contributions and Interactions
Interaction-aware Multimodal Architectures
Method
Architecture and Training
Overview of I$^2$MoE
Feature-level Encoders and Fusion Module
Inference-time Explanation Pipeline
Expert-wise Feature Attribution
Monte Carlo Estimation of Feature-level Interactions
Synergy via the Shapley Interaction Index.
Redundancy via a redundancy gap.
Experiments
...and 12 more sections

Figures (5)

Figure 1: Input modalities for the qualitative MM-IMDb example: (a) movie poster image and (b) plot summary (Four Minutes, true labels: drama, music).
Figure 2: Overview of the compared architectures and our analysis pipeline. (a) dense Transformer with pooled modality encoders and a single fusion module. (b) Original I$^2$MoE with uniqueness, synergy, and redundancy experts operating on pooled modality vectors. (c) Proposed FL-I$^2$MoE: patch- and token-level encoders feed interaction experts, which output class logits for the main task. In the dashed region, at inference time, we derive expert-wise feature importance maps, propagate them to select cross-modal token-patch pairs, and then compute synergy and redundancy-gap scores to identify synergistic or redundant pairs.
Figure 3: Average performance drop when masking the top-$K\%$ features according to different attribution methods (Random, AttnRoll, IG, and Grad$\times$AttnRoll).
Figure 4: Alignment between expert-wise importance and Monte Carlo interaction metrics. Top: mean SII (synergy expert). Bottom: mean redundancy gap $R_{\text{red}}$ (redundancy expert). Bars compare importance bins (x-axis) and top-$q\%$ sets.
Figure 5: Qualitative comparison on an MM-IMDb example. (a) dense Transformer with separate image and text saliency maps. (b) FL-I$^2$MoE. Top row: uni-text, uni-image, synergy, and redundancy experts. Bottom row: token-patch pairs by Monte Carlo interaction metrics for high synergy (SII) or redundancy gap.

Feature-level Interaction Explanations in Multimodal Transformers

Abstract

Feature-level Interaction Explanations in Multimodal Transformers

Authors

Abstract

Table of Contents

Figures (5)