Table of Contents
Fetching ...

Multi-modal Vision Pre-training for Medical Image Analysis

Shaohao Rui, Lingzhi Chen, Zhenyu Tang, Lilong Wang, Mianxin Liu, Shaoting Zhang, Xiaosong Wang

TL;DR

BrainMVP tackles the lack of cross-modality learning in medical SSL by leveraging naturally grouped multi-modal mpMRI data. It introduces three proxy tasks—cross-modal reconstruction, modality-wise data distillation, and modality-aware contrastive learning—together with learnable modality templates to bridge pre-training and downstream tasks. Trained on $16{,}022$ mpMRI scans from $3{,}755$ patients across $8$ modalities, BrainMVP achieves Dice score gains ranging from $0.28\%$ to $14.47\%$ and ACC gains from $0.65\%$ to $18.07\%$ across ten downstream tasks, with strong label-efficiency demonstrated at $40\%$ of labeled data. The framework yields superior generalization across ten public benchmarks, linking cross-modal structure with downstream performance via distilled templates, and holds promise for scalable, privacy-conscious clinical deployment in multi-modal MRI analysis.

Abstract

Self-supervised learning has greatly facilitated medical image analysis by suppressing the training data requirement for real-world applications. Current paradigms predominantly rely on self-supervision within uni-modal image data, thereby neglecting the inter-modal correlations essential for effective learning of cross-modal image representations. This limitation is particularly significant for naturally grouped multi-modal data, e.g., multi-parametric MRI scans for a patient undergoing various functional imaging protocols in the same study. To bridge this gap, we conduct a novel multi-modal image pre-training with three proxy tasks to facilitate the learning of cross-modality representations and correlations using multi-modal brain MRI scans (over 2.4 million images in 16,022 scans of 3,755 patients), i.e., cross-modal image reconstruction, modality-aware contrastive learning, and modality template distillation. To demonstrate the generalizability of our pre-trained model, we conduct extensive experiments on various benchmarks with ten downstream tasks. The superior performance of our method is reported in comparison to state-of-the-art pre-training methods, with Dice Score improvement of 0.28\%-14.47\% across six segmentation benchmarks and a consistent accuracy boost of 0.65\%-18.07\% in four individual image classification tasks.

Multi-modal Vision Pre-training for Medical Image Analysis

TL;DR

BrainMVP tackles the lack of cross-modality learning in medical SSL by leveraging naturally grouped multi-modal mpMRI data. It introduces three proxy tasks—cross-modal reconstruction, modality-wise data distillation, and modality-aware contrastive learning—together with learnable modality templates to bridge pre-training and downstream tasks. Trained on mpMRI scans from patients across modalities, BrainMVP achieves Dice score gains ranging from to and ACC gains from to across ten downstream tasks, with strong label-efficiency demonstrated at of labeled data. The framework yields superior generalization across ten public benchmarks, linking cross-modal structure with downstream performance via distilled templates, and holds promise for scalable, privacy-conscious clinical deployment in multi-modal MRI analysis.

Abstract

Self-supervised learning has greatly facilitated medical image analysis by suppressing the training data requirement for real-world applications. Current paradigms predominantly rely on self-supervision within uni-modal image data, thereby neglecting the inter-modal correlations essential for effective learning of cross-modal image representations. This limitation is particularly significant for naturally grouped multi-modal data, e.g., multi-parametric MRI scans for a patient undergoing various functional imaging protocols in the same study. To bridge this gap, we conduct a novel multi-modal image pre-training with three proxy tasks to facilitate the learning of cross-modality representations and correlations using multi-modal brain MRI scans (over 2.4 million images in 16,022 scans of 3,755 patients), i.e., cross-modal image reconstruction, modality-aware contrastive learning, and modality template distillation. To demonstrate the generalizability of our pre-trained model, we conduct extensive experiments on various benchmarks with ten downstream tasks. The superior performance of our method is reported in comparison to state-of-the-art pre-training methods, with Dice Score improvement of 0.28\%-14.47\% across six segmentation benchmarks and a consistent accuracy boost of 0.65\%-18.07\% in four individual image classification tasks.

Paper Structure

This paper contains 22 sections, 8 equations, 6 figures, 6 tables, 1 algorithm.

Figures (6)

  • Figure 1: (a) There are naturally grouped multi-modal data, e.g., multi-parametric MRI scans in the real world. (b) We propose three proxy tasks to facilitate the learning of cross-modality representations and correlations. Blue cubes represent one modality in a study, green cubes represent another modality within the same study, and gray cubes with flame symbols represent learnable modality templates. (c) We apply the pre-trained model and distilled modality templates for downstream tasks.
  • Figure 2: Overview of the proposed BrainMVP, comprised of (a) cross-modal reconstruction module that aims at learning a mapping from images masked with another modality to the original; (b) modality-wise data distillation module that learns condensed modality templates via gradient backpropagation; and (c) modality-aware contrastive learning module for introducing study/case-level modality invariance to the learned features.
  • Figure 3: Label efficiency results of the downstream segmentation and classification tasks. We report the mean Dice Score (%) in segmentation and the area under the curve (AUC) in classification.
  • Figure 4: Visualization of distilled modality templates along the pre-training trajectories.
  • Figure 5: Modality-wise data distillation for downstream tasks. The input multi-modal MRI scans are randomly selected to replace a certain number of modalities with the corresponding modality templates. Then L2 norm is used to ensure feature consistency between the two replacement copies. Finally, the task head is replaced with corresponding modules based on the task type.
  • ...and 1 more figures