Table of Contents
Fetching ...

What are You Looking at? Modality Contribution in Multimodal Medical Deep Learning

Christian Gapp, Elias Tappeiner, Martin Welk, Karl Fritscher, Elke Ruth Gizewski, Rainer Schubert

TL;DR

This work introduces a model- and performance-agnostic occlusion-based metric to quantify the contribution of each modality in multimodal medical deep learning. By masking inputs and measuring how the model output changes, it yields per-modality ($m_i$) and per-patch per-modality ($mp_i^l$) importance, applicable across arbitrary architectures. Applied to three medical tasks—Chest X-Ray with clinical text, BRSET, and Hecktor 22—the method reveals cases of balanced multimodal processing as well as unimodal collapse, and highlights clinically relevant features driving decisions. The results provide a tool to guide architecture choice and dataset design, aiming to improve trustworthiness and integration of multimodal AI in clinical practice, with code publicly available at the cited repository.

Abstract

Purpose High dimensional, multimodal data can nowadays be analyzed by huge deep neural networks with little effort. Several fusion methods for bringing together different modalities have been developed. Given the prevalence of high-dimensional, multimodal patient data in medicine, the development of multimodal models marks a significant advancement. However, how these models process information from individual sources in detail is still underexplored. Methods To this end, we implemented an occlusion-based modality contribution method that is both model- and performance-agnostic. This method quantitatively measures the importance of each modality in the dataset for the model to fulfill its task. We applied our method to three different multimodal medical problems for experimental purposes. Results Herein we found that some networks have modality preferences that tend to unimodal collapses, while some datasets are imbalanced from the ground up. Moreover, we provide fine-grained quantitative and visual attribute importance for each modality. Conclusion Our metric offers valuable insights that can support the advancement of multimodal model development and dataset creation. By introducing this method, we contribute to the growing field of interpretability in deep learning for multimodal research. This approach helps to facilitate the integration of multimodal AI into clinical practice. Our code is publicly available at https://github.com/ChristianGappGit/MC_MMD.

What are You Looking at? Modality Contribution in Multimodal Medical Deep Learning

TL;DR

This work introduces a model- and performance-agnostic occlusion-based metric to quantify the contribution of each modality in multimodal medical deep learning. By masking inputs and measuring how the model output changes, it yields per-modality () and per-patch per-modality () importance, applicable across arbitrary architectures. Applied to three medical tasks—Chest X-Ray with clinical text, BRSET, and Hecktor 22—the method reveals cases of balanced multimodal processing as well as unimodal collapse, and highlights clinically relevant features driving decisions. The results provide a tool to guide architecture choice and dataset design, aiming to improve trustworthiness and integration of multimodal AI in clinical practice, with code publicly available at the cited repository.

Abstract

Purpose High dimensional, multimodal data can nowadays be analyzed by huge deep neural networks with little effort. Several fusion methods for bringing together different modalities have been developed. Given the prevalence of high-dimensional, multimodal patient data in medicine, the development of multimodal models marks a significant advancement. However, how these models process information from individual sources in detail is still underexplored. Methods To this end, we implemented an occlusion-based modality contribution method that is both model- and performance-agnostic. This method quantitatively measures the importance of each modality in the dataset for the model to fulfill its task. We applied our method to three different multimodal medical problems for experimental purposes. Results Herein we found that some networks have modality preferences that tend to unimodal collapses, while some datasets are imbalanced from the ground up. Moreover, we provide fine-grained quantitative and visual attribute importance for each modality. Conclusion Our metric offers valuable insights that can support the advancement of multimodal model development and dataset creation. By introducing this method, we contribute to the growing field of interpretability in deep learning for multimodal research. This approach helps to facilitate the integration of multimodal AI into clinical practice. Our code is publicly available at https://github.com/ChristianGappGit/MC_MMD.

Paper Structure

This paper contains 28 sections, 6 equations, 8 figures, 9 tables, 1 algorithm.

Figures (8)

  • Figure 1: Chest X-Ray + Clinical Report. Example item CXR1897_IM-0581-1001. Disease: support devices. Orange words (labels) removed during the preprocessing step.
  • Figure 2: BRSET. Image img01468. Preprocessing for trainings routine. The source images were normalized with mean = [0.485, 0.456, 0.406] and std = [0.229, 0.224, 0.225]. Note that black parts in the transformed image inside the eye are still distinguishable by the model.
  • Figure 3: CXR1897_IM-0581-1001: Correctly predicted disease: support devices. Modality contribution vision : text = 0.24 : 0.76. Model: ViTLLAMA II. From blue to red the contribution (low to high) from a single patch (vision) or word (text) to the task is highlighted. Top, left to right: source image, GradCAM, class specific Occlusion Sensitivity for class support devices (MONAI), Occlusion Sensitivity averaged over all classes (CG, i.e. ours). The red patch in the upper right area in image Occ. sens. (MONAI) has the highest contribution to the class support devices. The same area is colored blue in image Occ. sens. (CG), as this patch has the lowest average contribution to all classes. Bottom: Text. MEAN: The words no and acute have the highest average contribution, catheter has the lowest. MAX: catheter has the highest contribution to one class: support devices.
  • Figure 4: img03501: Correctly predicted disease: drusens. Modality contribution vision : tabular = 0.95 : 0.05. Model: ResNet-MLP. Importance (low to high) is colored from blue to red. Top, left to right: source image, GradCAM, class specific Occlusion Sensitivity for class drusens (MONAI), Occlusion Sensitivity averaged over all classes (CG, i.e. ours). Bottom: tabular data with attributes patient age, comorbidities, diabetes time, insulin use, patient sex, exam eye, diabetes from left to right. MEAN: The patient's age has the highest contribution, patient sex the lowest in average. MAX: patient's age is the most significant attribute for one class: drusens.
  • Figure 5: 3D plot: $h$(vision) vs. $h$(text) vs. $m$ for both modalities, vision and text.
  • ...and 3 more figures