MDA: An Interpretable and Scalable Multi-Modal Fusion under Missing Modalities and Intrinsic Noise Conditions

Lin Fan; Yafei Ou; Cenyang Zheng; Pengyu Dai; Tamotsu Kamishima; Masayuki Ikebe; Kenji Suzuki; Xun Gong

MDA: An Interpretable and Scalable Multi-Modal Fusion under Missing Modalities and Intrinsic Noise Conditions

Lin Fan, Yafei Ou, Cenyang Zheng, Pengyu Dai, Tamotsu Kamishima, Masayuki Ikebe, Kenji Suzuki, Xun Gong

TL;DR

The observations on the contribution of different modalities indicate that MDA aligns with established clinical diagnostic imaging gold standards and holds promise as a reference for pathologies where these standards are not yet clearly defined.

Abstract

Multi-modal learning has shown exceptional performance in various tasks, especially in medical applications, where it integrates diverse medical information for comprehensive diagnostic evidence. However, there still are several challenges in multi-modal learning, 1. Heterogeneity between modalities, 2. uncertainty in missing modalities, 3. influence of intrinsic noise, and 4. interpretability for fusion result. This paper introduces the Modal-Domain Attention (MDA) model to address the above challenges. MDA constructs linear relationships between modalities through continuous attention, due to its ability to adaptively allocate dynamic attention to different modalities, MDA can reduce attention to low-correlation data, missing modalities, or modalities with inherent noise, thereby maintaining SOTA performance across various tasks on multiple public datasets. Furthermore, our observations on the contribution of different modalities indicate that MDA aligns with established clinical diagnostic imaging gold standards and holds promise as a reference for pathologies where these standards are not yet clearly defined. The code and dataset will be available.

MDA: An Interpretable and Scalable Multi-Modal Fusion under Missing Modalities and Intrinsic Noise Conditions

TL;DR

Abstract

Paper Structure (14 sections, 8 equations, 4 figures, 6 tables)

This paper contains 14 sections, 8 equations, 4 figures, 6 tables.

Introduction
Methodology
Image modalities
Text modalities
Audio modalities
Continuous attention mechanism
Objective function
Training setting
Datasets
Experiments and discussion
The efficacy of MDA in confronting the three key challenges of multi-modal fusion
Comprehensive comparison in different database
Interpretability analysis
Conclusion

Figures (4)

Figure 1: A unified multi-modal learning strategy involves learning with different multi-modal configurations. (a) Train and test with full modality. Different modalities receive equal attention (b) The model will reduce its modal-domain attention when learning with missing modalities or intrinsic noise.
Figure 2: Existing methods vs. MDA for missing modality. Generation-based methods cui2022survival add specific models (Gen-1, Gen-2) for missing modalities, similar to prompt-based methods lee2023multimodal that add different prompts (Pr-1,Pr-2, Pr-3), complexifying network parameters intensify training difficulty. This work builds linear attention relationships between modalities, adaptively adjusting their weights in real-time, and more efficiently handling various scenarios involving missing modalities.
Figure 3: Overview of the proposed network framework. The uni-modal training involves building pre-trained models for multi-disease classification across different modalities, with a dedicated classifier assigned to each disease. The work introduces the weight calculation for each scenario under the proposed modal-domain attention module.
Figure 4: Macroscopic investigation of MDA weights for various diseases.

MDA: An Interpretable and Scalable Multi-Modal Fusion under Missing Modalities and Intrinsic Noise Conditions

TL;DR

Abstract

MDA: An Interpretable and Scalable Multi-Modal Fusion under Missing Modalities and Intrinsic Noise Conditions

Authors

TL;DR

Abstract

Table of Contents

Figures (4)