Measuring Cross-Modal Interactions in Multimodal Models
Laura Wenderoth, Konstantin Hemker, Nikola Simidjievski, Mateja Jamnik
TL;DR
InterSHAP introduces a model-agnostic cross-modal interaction score based on the Shapley Interaction Index to quantify how multiple data modalities jointly influence predictions. It delivers both global and local explanations and supports any number of modalities, including unlabelled data, by measuring the ratio of cross-modal interactions to overall model behaviour. The method is validated on synthetic high-dimensional XOR data and SUM datasets, as well as real healthcare datasets (multimodal single-cell and MIMIC-III), demonstrating accurate detection of interactions and meaningful modality contributions. An open-source implementation integrates with the SHAP package, enabling reproducible, interpretable analysis to diagnose and improve multimodal models in clinical settings.
Abstract
Integrating AI in healthcare can greatly improve patient care and system efficiency. However, the lack of explainability in AI systems (XAI) hinders their clinical adoption, especially in multimodal settings that use increasingly complex model architectures. Most existing XAI methods focus on unimodal models, which fail to capture cross-modal interactions crucial for understanding the combined impact of multiple data sources. Existing methods for quantifying cross-modal interactions are limited to two modalities, rely on labelled data, and depend on model performance. This is problematic in healthcare, where XAI must handle multiple data sources and provide individualised explanations. This paper introduces InterSHAP, a cross-modal interaction score that addresses the limitations of existing approaches. InterSHAP uses the Shapley interaction index to precisely separate and quantify the contributions of the individual modalities and their interactions without approximations. By integrating an open-source implementation with the SHAP package, we enhance reproducibility and ease of use. We show that InterSHAP accurately measures the presence of cross-modal interactions, can handle multiple modalities, and provides detailed explanations at a local level for individual samples. Furthermore, we apply InterSHAP to multimodal medical datasets and demonstrate its applicability for individualised explanations.
