Table of Contents
Fetching ...

3D Modality-Aware Pre-training for Vision-Language Model in MRI Multi-organ Abnormality Detection

Haowen Zhu, Ning Yin, Xiaogen Zhou

TL;DR

This work proposes MedMAP, a Medical Modality-Aware Pretraining framework that enhances vision-language representation learning in 3D MRI and significantly outperforms existing VLMs in 3D MRI-based multi-organ abnormality detection.

Abstract

Vision-language models (VLMs) show strong potential for complex diagnostic tasks in medical imaging. However, applying VLMs to multi-organ medical imaging introduces two principal challenges: (1) modality-specific vision-language alignment and (2) cross-modal feature fusion. In this work, we propose MedMAP, a Medical Modality-Aware Pretraining framework that enhances vision-language representation learning in 3D MRI. MedMAP comprises a modality-aware vision-language alignment stage and a fine-tuning stage for multi-organ abnormality detection. During the pre-training stage, the modality-aware encoders implicitly capture the joint modality distribution and improve alignment between visual and textual representations. We then fine-tune the pre-trained vision encoders (while keeping the text encoder frozen) for downstream tasks. To this end, we curated MedMoM-MRI3D, comprising 7,392 3D MRI volume-report pairs spanning twelve MRI modalities and nine abnormalities tailored for various 3D medical analysis tasks. Extensive experiments on MedMoM-MRI3D demonstrate that MedMAP significantly outperforms existing VLMs in 3D MRI-based multi-organ abnormality detection. Our code is available at https://github.com/RomantiDr/MedMAP.

3D Modality-Aware Pre-training for Vision-Language Model in MRI Multi-organ Abnormality Detection

TL;DR

This work proposes MedMAP, a Medical Modality-Aware Pretraining framework that enhances vision-language representation learning in 3D MRI and significantly outperforms existing VLMs in 3D MRI-based multi-organ abnormality detection.

Abstract

Vision-language models (VLMs) show strong potential for complex diagnostic tasks in medical imaging. However, applying VLMs to multi-organ medical imaging introduces two principal challenges: (1) modality-specific vision-language alignment and (2) cross-modal feature fusion. In this work, we propose MedMAP, a Medical Modality-Aware Pretraining framework that enhances vision-language representation learning in 3D MRI. MedMAP comprises a modality-aware vision-language alignment stage and a fine-tuning stage for multi-organ abnormality detection. During the pre-training stage, the modality-aware encoders implicitly capture the joint modality distribution and improve alignment between visual and textual representations. We then fine-tune the pre-trained vision encoders (while keeping the text encoder frozen) for downstream tasks. To this end, we curated MedMoM-MRI3D, comprising 7,392 3D MRI volume-report pairs spanning twelve MRI modalities and nine abnormalities tailored for various 3D medical analysis tasks. Extensive experiments on MedMoM-MRI3D demonstrate that MedMAP significantly outperforms existing VLMs in 3D MRI-based multi-organ abnormality detection. Our code is available at https://github.com/RomantiDr/MedMAP.
Paper Structure (12 sections, 4 equations, 2 figures, 2 tables)

This paper contains 12 sections, 4 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Overview of the MedMAP framework. (a) Pre-training stage: The vision and text encoders are pretrained using modality-specific MRI volumes and reports. (b) Fine-tuning stage: The text encoder remains frozen, while a projector and the vision pipeline are trained. Additionally, the cross-modal semantic aggregation (CSA) module integrates pathological visual and textual tokens through structured cross-modal interactions.
  • Figure 2: Qualitative analysis of (i) CSA module and (ii) model interpretability.