Table of Contents
Fetching ...

Measuring Cross-Modal Interactions in Multimodal Models

Laura Wenderoth, Konstantin Hemker, Nikola Simidjievski, Mateja Jamnik

TL;DR

InterSHAP introduces a model-agnostic cross-modal interaction score based on the Shapley Interaction Index to quantify how multiple data modalities jointly influence predictions. It delivers both global and local explanations and supports any number of modalities, including unlabelled data, by measuring the ratio of cross-modal interactions to overall model behaviour. The method is validated on synthetic high-dimensional XOR data and SUM datasets, as well as real healthcare datasets (multimodal single-cell and MIMIC-III), demonstrating accurate detection of interactions and meaningful modality contributions. An open-source implementation integrates with the SHAP package, enabling reproducible, interpretable analysis to diagnose and improve multimodal models in clinical settings.

Abstract

Integrating AI in healthcare can greatly improve patient care and system efficiency. However, the lack of explainability in AI systems (XAI) hinders their clinical adoption, especially in multimodal settings that use increasingly complex model architectures. Most existing XAI methods focus on unimodal models, which fail to capture cross-modal interactions crucial for understanding the combined impact of multiple data sources. Existing methods for quantifying cross-modal interactions are limited to two modalities, rely on labelled data, and depend on model performance. This is problematic in healthcare, where XAI must handle multiple data sources and provide individualised explanations. This paper introduces InterSHAP, a cross-modal interaction score that addresses the limitations of existing approaches. InterSHAP uses the Shapley interaction index to precisely separate and quantify the contributions of the individual modalities and their interactions without approximations. By integrating an open-source implementation with the SHAP package, we enhance reproducibility and ease of use. We show that InterSHAP accurately measures the presence of cross-modal interactions, can handle multiple modalities, and provides detailed explanations at a local level for individual samples. Furthermore, we apply InterSHAP to multimodal medical datasets and demonstrate its applicability for individualised explanations.

Measuring Cross-Modal Interactions in Multimodal Models

TL;DR

InterSHAP introduces a model-agnostic cross-modal interaction score based on the Shapley Interaction Index to quantify how multiple data modalities jointly influence predictions. It delivers both global and local explanations and supports any number of modalities, including unlabelled data, by measuring the ratio of cross-modal interactions to overall model behaviour. The method is validated on synthetic high-dimensional XOR data and SUM datasets, as well as real healthcare datasets (multimodal single-cell and MIMIC-III), demonstrating accurate detection of interactions and meaningful modality contributions. An open-source implementation integrates with the SHAP package, enabling reproducible, interpretable analysis to diagnose and improve multimodal models in clinical settings.

Abstract

Integrating AI in healthcare can greatly improve patient care and system efficiency. However, the lack of explainability in AI systems (XAI) hinders their clinical adoption, especially in multimodal settings that use increasingly complex model architectures. Most existing XAI methods focus on unimodal models, which fail to capture cross-modal interactions crucial for understanding the combined impact of multiple data sources. Existing methods for quantifying cross-modal interactions are limited to two modalities, rely on labelled data, and depend on model performance. This is problematic in healthcare, where XAI must handle multiple data sources and provide individualised explanations. This paper introduces InterSHAP, a cross-modal interaction score that addresses the limitations of existing approaches. InterSHAP uses the Shapley interaction index to precisely separate and quantify the contributions of the individual modalities and their interactions without approximations. By integrating an open-source implementation with the SHAP package, we enhance reproducibility and ease of use. We show that InterSHAP accurately measures the presence of cross-modal interactions, can handle multiple modalities, and provides detailed explanations at a local level for individual samples. Furthermore, we apply InterSHAP to multimodal medical datasets and demonstrate its applicability for individualised explanations.

Paper Structure

This paper contains 44 sections, 11 equations, 3 figures, 13 tables, 2 algorithms.

Figures (3)

  • Figure 1: Overview of InterSHAP. The model (black box), takes three different modalities as input and produces an output $f(x)$. Through perturbations of the input modalities and observing the resulting changes in outputs, the Shapley interaction index grabisch_axiomatic_1999 is used to dissect the model's behaviour into modality contributions and cross-modal interactions. InterSHAP is defined as the ratio of interactions to model behaviour.
  • Figure 2: Visualisation of InterSHAP using the SHAP package integration lundberg_unified_2017. The results on the HD-XOR datasets with two modalities for FCNN with early fusion are presented. The x-axes show predicted class probabilities, with a baseline of approximately 0.5 due to the binary classification. M1 represents modality 1, M2 modality 2, and I interactions.
  • Figure 3: SHAP visualisations of InterSHAP computed interactions and modality contributions derived from the FCNN early fusion model trained on the multimodal single-cell dataset. (a) Force plot of model's behaviour on the whole dataset, with the x-axis representing the class probability of the highest probability class. (b)-(d) Breakdown by predicted class.