AliFuse: Aligning and Fusing Multi-modal Medical Data for Computer-Aided Diagnosis
Qiuhui Chen, Yi Hong
TL;DR
Alifuse addresses the challenge of integrating imaging and non-imaging clinical data for computer-aided diagnosis by introducing a cross-modal alignment and fusion framework built on transformers. It jointly optimizes image-text contrastive alignment and cross-modal restoration, enabling robust feature fusion even with missing non-imaging data. Evaluated on five public AD datasets totaling 27.8K image volumes, Alifuse achieves state-of-the-art Alzheimer's disease classification and offers interpretable attention-based insights into modality interactions. The approach generalizes to additional modalities and diseases, providing a practical path toward universal multimodal representations for medical diagnosis.
Abstract
Medical data collected for diagnostic decisions are typically multimodal, providing comprehensive information on a subject. While computer-aided diagnosis systems can benefit from multimodal inputs, effectively fusing such data remains a challenging task and a key focus in medical research. In this paper, we propose a transformer-based framework, called Alifuse, for aligning and fusing multimodal medical data. Specifically, we convert medical images and both unstructured and structured clinical records into vision and language tokens, employing intramodal and intermodal attention mechanisms to learn unified representations of all imaging and non-imaging data for classification. Additionally, we integrate restoration modeling with contrastive learning frameworks, jointly learning the high-level semantic alignment between images and texts and the low-level understanding of one modality with the help of another. We apply Alifuse to classify Alzheimer's disease, achieving state-of-the-art performance on five public datasets and outperforming eight baselines.
