Table of Contents
Fetching ...

MultiSHAP: A Shapley-Based Framework for Explaining Cross-Modal Interactions in Multimodal AI Models

Zhanliang Wang, Kai Wang

TL;DR

MultiSHAP is introduced, a model-agnostic interpretability framework that leverages the Shapley Interaction Index to attribute multimodal predictions to pairwise interactions between fine-grained visual and textual elements, while being applicable to both open- and closed-source models.

Abstract

Multimodal AI models have achieved impressive performance in tasks that require integrating information from multiple modalities, such as vision and language. However, their "black-box" nature poses a major barrier to deployment in high-stakes applications where interpretability and trustworthiness are essential. How to explain cross-modal interactions in multimodal AI models remains a major challenge. While existing model explanation methods, such as attention map and Grad-CAM, offer coarse insights into cross-modal relationships, they cannot precisely quantify the synergistic effects between modalities, and are limited to open-source models with accessible internal weights. Here we introduce MultiSHAP, a model-agnostic interpretability framework that leverages the Shapley Interaction Index to attribute multimodal predictions to pairwise interactions between fine-grained visual and textual elements (such as image patches and text tokens), while being applicable to both open- and closed-source models. Our approach provides: (1) instance-level explanations that reveal synergistic and suppressive cross-modal effects for individual samples - "why the model makes a specific prediction on this input", and (2) dataset-level explanation that uncovers generalizable interaction patterns across samples - "how the model integrates information across modalities". Experiments on public multimodal benchmarks confirm that MultiSHAP faithfully captures cross-modal reasoning mechanisms, while real-world case studies demonstrate its practical utility. Our framework is extensible beyond two modalities, offering a general solution for interpreting complex multimodal AI models.

MultiSHAP: A Shapley-Based Framework for Explaining Cross-Modal Interactions in Multimodal AI Models

TL;DR

MultiSHAP is introduced, a model-agnostic interpretability framework that leverages the Shapley Interaction Index to attribute multimodal predictions to pairwise interactions between fine-grained visual and textual elements, while being applicable to both open- and closed-source models.

Abstract

Multimodal AI models have achieved impressive performance in tasks that require integrating information from multiple modalities, such as vision and language. However, their "black-box" nature poses a major barrier to deployment in high-stakes applications where interpretability and trustworthiness are essential. How to explain cross-modal interactions in multimodal AI models remains a major challenge. While existing model explanation methods, such as attention map and Grad-CAM, offer coarse insights into cross-modal relationships, they cannot precisely quantify the synergistic effects between modalities, and are limited to open-source models with accessible internal weights. Here we introduce MultiSHAP, a model-agnostic interpretability framework that leverages the Shapley Interaction Index to attribute multimodal predictions to pairwise interactions between fine-grained visual and textual elements (such as image patches and text tokens), while being applicable to both open- and closed-source models. Our approach provides: (1) instance-level explanations that reveal synergistic and suppressive cross-modal effects for individual samples - "why the model makes a specific prediction on this input", and (2) dataset-level explanation that uncovers generalizable interaction patterns across samples - "how the model integrates information across modalities". Experiments on public multimodal benchmarks confirm that MultiSHAP faithfully captures cross-modal reasoning mechanisms, while real-world case studies demonstrate its practical utility. Our framework is extensible beyond two modalities, offering a general solution for interpreting complex multimodal AI models.

Paper Structure

This paper contains 44 sections, 9 equations, 12 figures, 5 tables, 1 algorithm.

Figures (12)

  • Figure 1: Overview of the MultiSHAP workflow. For a sample $k$, the input image is partitioned into $m$ patches and the text query into $n$ tokens. The model $f$ is evaluated on masked patch--token combinations, and MultiSHAP estimates a cross-modal interaction matrix $\boldsymbol{\Phi}^{(k)}\in\mathbb{R}^{m\times n}$, where $\Phi^{(k)}_{ij}$ denotes the Shapley interaction between image patch $i$ and text token $j$. Interactions are approximated via Monte Carlo sampling with $K$ coalitions per sample. The resulting matrix can be visualized as token-specific interaction heatmaps and aggregated cross-modal attribution maps (e.g., averaged over tokens). It can also be quantified using interaction-based interpretability metrics: the synergy ratio $R_k$ summarizes, for each instance, the relative dominance of synergistic (positive) versus suppressive (negative) interactions; at the dataset level, the Mean Synergy Ratio (MSR) measures the average tendency toward synergistic interactions, and the Synergy Dominance Ratio (SDR) reports the fraction of samples in which synergy outweighs suppression. Positive (red) and negative (blue) values indicate synergistic and suppressive cross-modal interactions, respectively.
  • Figure 2: MultiSHAP reveals distinct cross-modal interaction patterns. Each heatmap visualizes patch--token interactions for one sample. Synergistic interactions (positive) highlight evidence that mutually reinforces across modalities, whereas suppressive interactions (negative) indicate conflicting evidence. In (a), synergy concentrates on diagnostically relevant facial regions and yields a correct rare-disease prediction. In (b), interactions emphasize less informative regions, corresponding to an incorrect diagnosis. In (c), suppression helps downweight misleading visual cues and supports a correct VQA answer, while in (d) spurious synergy with irrelevant objects contributes to failure. See Appendix \ref{['app:token_heatmaps']} for additional examples (Example 3 and Example 6) and Supplementary Information for token-wise analysis.
  • Figure 3: MultiSHAP captures semantic alignment in image--text retrieval. Each panel contrasts the interaction patterns induced by a ground-truth (GT) caption versus a foil caption describing a mismatched object. In (a) and (c), GT captions yield concentrated synergistic interactions on the correct visual evidence (e.g., banana, onions). In (b) and (d), foil captions induce suppressive interactions over the true object regions, indicating semantic mismatch. Together, these examples illustrate how MultiSHAP differentiates aligned versus misaligned image--text pairs through patch--token interactions.
  • Figure 4: MultiSHAP metrics correlate with phenotypic distinctiveness across rare disease cohorts. UMAP visualization of patient image embeddings from three rare disease cohorts in the GestaltMatcher Database. Distinct clustering is observed for Cornelia de Lange syndrome (CdLS), Noonan syndrome, and Angelman syndrome. Dataset-level MultiSHAP statistics (inset) show that Mean Synergy Ratio (MSR) and Synergy Dominance Ratio (SDR) decrease with phenotypic distinctiveness: CdLS (most distinctive facial features) exhibits the strongest multimodal synergy (MSR $= 0.61$, SDR $= 0.57$), followed by Noonan syndrome (MSR $= 0.59$, SDR $= 0.56$), while Angelman syndrome (least distinctive facial morphology) shows the weakest synergy (MSR $= 0.54$, SDR $= 0.53$). This correlation suggests that cross-modal interactions contribute more strongly to predictions when facial phenotypes are more informative.
  • Figure S5: Token-level interaction heatmaps for VQA Example 3.Question: "What is on the plates?" Answer: "breakfast" (correct). This successful case demonstrates ideal synergistic patterns where content words create strong positive interactions with semantically relevant food regions, while spatial tokens properly bind objects to locations.
  • ...and 7 more figures