Table of Contents
Fetching ...

DrFuse: Learning Disentangled Representation for Clinical Multi-Modal Fusion with Missing Modality and Modal Inconsistency

Wenfang Yao, Kejing Yin, William K. Cheung, Jia Liu, Jing Qin

TL;DR

This work tackles the missing modality issue by disentangling the features shared across modalities and those unique within each modality and addresses the modal inconsistency issue via a disease-wise attention layer that produces the patient- and disease-wise weighting for each modality to make the final prediction.

Abstract

The combination of electronic health records (EHR) and medical images is crucial for clinicians in making diagnoses and forecasting prognosis. Strategically fusing these two data modalities has great potential to improve the accuracy of machine learning models in clinical prediction tasks. However, the asynchronous and complementary nature of EHR and medical images presents unique challenges. Missing modalities due to clinical and administrative factors are inevitable in practice, and the significance of each data modality varies depending on the patient and the prediction target, resulting in inconsistent predictions and suboptimal model performance. To address these challenges, we propose DrFuse to achieve effective clinical multi-modal fusion. It tackles the missing modality issue by disentangling the features shared across modalities and those unique within each modality. Furthermore, we address the modal inconsistency issue via a disease-wise attention layer that produces the patient- and disease-wise weighting for each modality to make the final prediction. We validate the proposed method using real-world large-scale datasets, MIMIC-IV and MIMIC-CXR. Experimental results show that the proposed method significantly outperforms the state-of-the-art models. Our implementation is publicly available at https://github.com/dorothy-yao/drfuse.

DrFuse: Learning Disentangled Representation for Clinical Multi-Modal Fusion with Missing Modality and Modal Inconsistency

TL;DR

This work tackles the missing modality issue by disentangling the features shared across modalities and those unique within each modality and addresses the modal inconsistency issue via a disease-wise attention layer that produces the patient- and disease-wise weighting for each modality to make the final prediction.

Abstract

The combination of electronic health records (EHR) and medical images is crucial for clinicians in making diagnoses and forecasting prognosis. Strategically fusing these two data modalities has great potential to improve the accuracy of machine learning models in clinical prediction tasks. However, the asynchronous and complementary nature of EHR and medical images presents unique challenges. Missing modalities due to clinical and administrative factors are inevitable in practice, and the significance of each data modality varies depending on the patient and the prediction target, resulting in inconsistent predictions and suboptimal model performance. To address these challenges, we propose DrFuse to achieve effective clinical multi-modal fusion. It tackles the missing modality issue by disentangling the features shared across modalities and those unique within each modality. Furthermore, we address the modal inconsistency issue via a disease-wise attention layer that produces the patient- and disease-wise weighting for each modality to make the final prediction. We validate the proposed method using real-world large-scale datasets, MIMIC-IV and MIMIC-CXR. Experimental results show that the proposed method significantly outperforms the state-of-the-art models. Our implementation is publicly available at https://github.com/dorothy-yao/drfuse.
Paper Structure (24 sections, 16 equations, 3 figures, 4 tables)

This paper contains 24 sections, 16 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: The overview of the proposed model, DrFuse. It consists of two major components. Subfigure (a): A shared representation and a distinct representation are learned from EHR and CXR, where the shared ones are aligned by minimizing the Jensen–Shannon divergence (JSD). A novel logit pooling is proposed to fuse the shared representations. Subfigure (b): The disease-aware attention fusion module captures the patient-specific modal significance for different prediction targets by minimizing a ranking loss.
  • Figure 2: Data flow in the disentangled representation learning module when the CXR modality is missing. The shared representation extracted from EHR will be directly used as $\mathbf{h}_{\text{shared}}$. Inactive components and loss terms are grayed out.
  • Figure 3: t-SNE visualization of distinct and shared features for the test set in the matched subset. DrFuse could well align the distributions of the EHR and CXR shared representations, as well as disentangling the distinct representations.

Theorems & Definitions (1)

  • Definition 1: Logit Pooling