CMViM: Contrastive Masked Vim Autoencoder for 3D Multi-modal Representation Learning for AD classification

Guangqian Yang; Kangrui Du; Zhihan Yang; Ye Du; Yongping Zheng; Shujun Wang

CMViM: Contrastive Masked Vim Autoencoder for 3D Multi-modal Representation Learning for AD classification

Guangqian Yang, Kangrui Du, Zhihan Yang, Ye Du, Yongping Zheng, Shujun Wang

TL;DR

<3-5 sentence high-level summary>CMViM tackles Alzheimer’s disease classification from $3D$ multi-modal data by introducing a Vision Mamba–based masked autoencoder that jointly encodes T1-MRI and PET images. The framework combines intra-modal and inter-modal contrastive learning to improve discriminative power and alignment across modalities, enabling efficient long-range representation learning for high-resolution volumes. Pre-training on the ADNI2 dataset yields a notable $2.7$ percentage point improvement in AUC over state-of-the-art methods, while reducing parameter counts relative to baselines like MultiMAE. The work demonstrates that integrating masked autoencoding with modality-specific contrastive cues enhances 3D multi-modal representations and downstream AD classification performance.

Abstract

Alzheimer's disease (AD) is an incurable neurodegenerative condition leading to cognitive and functional deterioration. Given the lack of a cure, prompt and precise AD diagnosis is vital, a complex process dependent on multiple factors and multi-modal data. While successful efforts have been made to integrate multi-modal representation learning into medical datasets, scant attention has been given to 3D medical images. In this paper, we propose Contrastive Masked Vim Autoencoder (CMViM), the first efficient representation learning method tailored for 3D multi-modal data. Our proposed framework is built on a masked Vim autoencoder to learn a unified multi-modal representation and long-dependencies contained in 3D medical images. We also introduce an intra-modal contrastive learning module to enhance the capability of the multi-modal Vim encoder for modeling the discriminative features in the same modality, and an inter-modal contrastive learning module to alleviate misaligned representation among modalities. Our framework consists of two main steps: 1) incorporate the Vision Mamba (Vim) into the mask autoencoder to reconstruct 3D masked multi-modal data efficiently. 2) align the multi-modal representations with contrastive learning mechanisms from both intra-modal and inter-modal aspects. Our framework is pre-trained and validated ADNI2 dataset and validated on the downstream task for AD classification. The proposed CMViM yields 2.7\% AUC performance improvement compared with other state-of-the-art methods.

CMViM: Contrastive Masked Vim Autoencoder for 3D Multi-modal Representation Learning for AD classification

TL;DR

<3-5 sentence high-level summary>CMViM tackles Alzheimer’s disease classification from

multi-modal data by introducing a Vision Mamba–based masked autoencoder that jointly encodes T1-MRI and PET images. The framework combines intra-modal and inter-modal contrastive learning to improve discriminative power and alignment across modalities, enabling efficient long-range representation learning for high-resolution volumes. Pre-training on the ADNI2 dataset yields a notable

percentage point improvement in AUC over state-of-the-art methods, while reducing parameter counts relative to baselines like MultiMAE. The work demonstrates that integrating masked autoencoding with modality-specific contrastive cues enhances 3D multi-modal representations and downstream AD classification performance.

Abstract

Paper Structure (20 sections, 10 equations, 1 figure, 2 tables)

This paper contains 20 sections, 10 equations, 1 figure, 2 tables.

Introduction
Method
Vision Mamba
Multi-modal Masked Vim autoencoder for 3D Representation Learning
Encoder
Decoder
Multi-modal Contrastive Learning
Intra-modal Contrastive Learning Module
Inter-modal Contrastive Learning Module
Training Procedure and Experimental Details
Pre-training
Finetuning
Experimental Results
Dataset Description
Data Pre-processing and Dataset division
...and 5 more sections

Figures (1)

Figure 1: Overview of our proposed Contrastive Masked Vim Autoencoder (CMViM). For pre-training, we build a multi-modality masked autoencoder based on Vim blocks for 3D medical visual representation learning. To further strengthen the abilities to capture disease-related features and relieve the representation misalignment between different modalities, we respectively introduce an intra-modal contrastive learning module and an inter-modal contrastive learning module. Then, the multi-modality Vim encoder pre-trained by our proposed CMViM is finetuned to AD classification.

CMViM: Contrastive Masked Vim Autoencoder for 3D Multi-modal Representation Learning for AD classification

TL;DR

Abstract

CMViM: Contrastive Masked Vim Autoencoder for 3D Multi-modal Representation Learning for AD classification

Authors

TL;DR

Abstract

Table of Contents

Figures (1)