Table of Contents
Fetching ...

AliFuse: Aligning and Fusing Multi-modal Medical Data for Computer-Aided Diagnosis

Qiuhui Chen, Yi Hong

TL;DR

Alifuse addresses the challenge of integrating imaging and non-imaging clinical data for computer-aided diagnosis by introducing a cross-modal alignment and fusion framework built on transformers. It jointly optimizes image-text contrastive alignment and cross-modal restoration, enabling robust feature fusion even with missing non-imaging data. Evaluated on five public AD datasets totaling 27.8K image volumes, Alifuse achieves state-of-the-art Alzheimer's disease classification and offers interpretable attention-based insights into modality interactions. The approach generalizes to additional modalities and diseases, providing a practical path toward universal multimodal representations for medical diagnosis.

Abstract

Medical data collected for diagnostic decisions are typically multimodal, providing comprehensive information on a subject. While computer-aided diagnosis systems can benefit from multimodal inputs, effectively fusing such data remains a challenging task and a key focus in medical research. In this paper, we propose a transformer-based framework, called Alifuse, for aligning and fusing multimodal medical data. Specifically, we convert medical images and both unstructured and structured clinical records into vision and language tokens, employing intramodal and intermodal attention mechanisms to learn unified representations of all imaging and non-imaging data for classification. Additionally, we integrate restoration modeling with contrastive learning frameworks, jointly learning the high-level semantic alignment between images and texts and the low-level understanding of one modality with the help of another. We apply Alifuse to classify Alzheimer's disease, achieving state-of-the-art performance on five public datasets and outperforming eight baselines.

AliFuse: Aligning and Fusing Multi-modal Medical Data for Computer-Aided Diagnosis

TL;DR

Alifuse addresses the challenge of integrating imaging and non-imaging clinical data for computer-aided diagnosis by introducing a cross-modal alignment and fusion framework built on transformers. It jointly optimizes image-text contrastive alignment and cross-modal restoration, enabling robust feature fusion even with missing non-imaging data. Evaluated on five public AD datasets totaling 27.8K image volumes, Alifuse achieves state-of-the-art Alzheimer's disease classification and offers interpretable attention-based insights into modality interactions. The approach generalizes to additional modalities and diseases, providing a practical path toward universal multimodal representations for medical diagnosis.

Abstract

Medical data collected for diagnostic decisions are typically multimodal, providing comprehensive information on a subject. While computer-aided diagnosis systems can benefit from multimodal inputs, effectively fusing such data remains a challenging task and a key focus in medical research. In this paper, we propose a transformer-based framework, called Alifuse, for aligning and fusing multimodal medical data. Specifically, we convert medical images and both unstructured and structured clinical records into vision and language tokens, employing intramodal and intermodal attention mechanisms to learn unified representations of all imaging and non-imaging data for classification. Additionally, we integrate restoration modeling with contrastive learning frameworks, jointly learning the high-level semantic alignment between images and texts and the low-level understanding of one modality with the help of another. We apply Alifuse to classify Alzheimer's disease, achieving state-of-the-art performance on five public datasets and outperforming eight baselines.
Paper Structure (21 sections, 5 equations, 4 figures, 3 tables)

This paper contains 21 sections, 5 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: The architecture overview of our proposed model Alifuse, a CAD system designed for medical diagnosis using electronic health records through multimodal alignment and fusion. The cross-modal alignment (CMA) is a key component of Alifuse, which is illustrated in Fig. \ref{['fig:cma']}.
  • Figure 2: The architecture of the cross-modal alignment (CMA) module.
  • Figure 3: Visualization of attention maps using transformer interpretability techniques. (a) ADNI test datasets, (b) NACC test dataset, (c) AIBL dataset, and (d) OASIS2 dataset. The image heatmap uses a jet colormap, where red indicates high activated values (close to one) and blue indicates low activated values (close to zero). A blue colormap is employed for the text heatmap, with darker blue representing higher attention levels. (Best viewed in color)
  • Figure 4: The UMAP visualization of generated image and text embeddings of AD and NC subjects from the ADNI tests set. Black lines: image-text pairs.