Multi-modal Vision Pre-training for Medical Image Analysis

Shaohao Rui; Lingzhi Chen; Zhenyu Tang; Lilong Wang; Mianxin Liu; Shaoting Zhang; Xiaosong Wang

Multi-modal Vision Pre-training for Medical Image Analysis

Shaohao Rui, Lingzhi Chen, Zhenyu Tang, Lilong Wang, Mianxin Liu, Shaoting Zhang, Xiaosong Wang

TL;DR

BrainMVP tackles the lack of cross-modality learning in medical SSL by leveraging naturally grouped multi-modal mpMRI data. It introduces three proxy tasks—cross-modal reconstruction, modality-wise data distillation, and modality-aware contrastive learning—together with learnable modality templates to bridge pre-training and downstream tasks. Trained on $16{,}022$ mpMRI scans from $3{,}755$ patients across $8$ modalities, BrainMVP achieves Dice score gains ranging from $0.28\%$ to $14.47\%$ and ACC gains from $0.65\%$ to $18.07\%$ across ten downstream tasks, with strong label-efficiency demonstrated at $40\%$ of labeled data. The framework yields superior generalization across ten public benchmarks, linking cross-modal structure with downstream performance via distilled templates, and holds promise for scalable, privacy-conscious clinical deployment in multi-modal MRI analysis.

Abstract

Self-supervised learning has greatly facilitated medical image analysis by suppressing the training data requirement for real-world applications. Current paradigms predominantly rely on self-supervision within uni-modal image data, thereby neglecting the inter-modal correlations essential for effective learning of cross-modal image representations. This limitation is particularly significant for naturally grouped multi-modal data, e.g., multi-parametric MRI scans for a patient undergoing various functional imaging protocols in the same study. To bridge this gap, we conduct a novel multi-modal image pre-training with three proxy tasks to facilitate the learning of cross-modality representations and correlations using multi-modal brain MRI scans (over 2.4 million images in 16,022 scans of 3,755 patients), i.e., cross-modal image reconstruction, modality-aware contrastive learning, and modality template distillation. To demonstrate the generalizability of our pre-trained model, we conduct extensive experiments on various benchmarks with ten downstream tasks. The superior performance of our method is reported in comparison to state-of-the-art pre-training methods, with Dice Score improvement of 0.28\%-14.47\% across six segmentation benchmarks and a consistent accuracy boost of 0.65\%-18.07\% in four individual image classification tasks.

Multi-modal Vision Pre-training for Medical Image Analysis

TL;DR

Abstract

Multi-modal Vision Pre-training for Medical Image Analysis

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)