Table of Contents
Fetching ...

Decipher-MR: A Vision-Language Foundation Model for 3D MRI Representations

Zhijian Yang, Noel DSouza, Istvan Megyeri, Xiaojian Xu, Amin Honarmandi Shandiz, Farzin Haddadpour, Krisztian Koos, Laszlo Rusko, Emanuele Valeriano, Bharadwaj Swaninathan, Lei Wu, Parminder Bhatia, Taha Kass-Hout, Erhan Bas

TL;DR

Decipher-MR targets MRI-specific foundation modeling by training a large, diverse 3D vision-language system on $200{,}000$ MRI series from over $22{,}000$ studies, augmented with radiology report supervision and a two-stage pretraining pipeline to align image and text representations. The model adopts a frozen encoder with modular, task-specific decoders, enabling efficient adaptation to classification, retrieval, segmentation, and localization tasks, and shows robust cross-domain performance and rapid convergence. Across extensive experiments, Decipher-MR outperforms MRI- and general-purpose baselines on multiple tasks, demonstrates strong cross-modal retrieval, and provides competitive segmentation and anomaly localization results, highlighting its potential for scalable MRI AI in clinical and research settings. The work emphasizes the importance of data diversity, region-aware supervision, and lightweight decoders for generalizable, efficient MRI analysis, while acknowledging biases and areas for future improvement such as region-level alignment and broader textual diversity.

Abstract

Magnetic Resonance Imaging is a critical imaging modality in clinical diagnosis and research, yet its complexity and heterogeneity hinder scalable, generalizable machine learning. Although foundation models have revolutionized language and vision tasks, their application to MRI remains constrained by data scarcity and narrow anatomical focus. We present Decipher-MR, a 3D MRI-specific vision-language foundation model trained on 200,000 MRI series from over 22,000 studies spanning diverse anatomical regions, sequences, and pathologies. Decipher-MR integrates self-supervised vision learning with report-guided text supervision to build robust representations for broad applications. To enable efficient use, Decipher-MR supports a modular design that enables tuning of lightweight, task-specific decoders attached to a frozen pretrained encoder. Following this setting, we evaluate Decipher-MR across disease classification, demographic prediction, anatomical localization, and cross-modal retrieval, demonstrating consistent improvements over existing foundation models and task-specific approaches. These results position Decipher-MR as a versatile foundation for MRI-based AI in clinical and research settings.

Decipher-MR: A Vision-Language Foundation Model for 3D MRI Representations

TL;DR

Decipher-MR targets MRI-specific foundation modeling by training a large, diverse 3D vision-language system on MRI series from over studies, augmented with radiology report supervision and a two-stage pretraining pipeline to align image and text representations. The model adopts a frozen encoder with modular, task-specific decoders, enabling efficient adaptation to classification, retrieval, segmentation, and localization tasks, and shows robust cross-domain performance and rapid convergence. Across extensive experiments, Decipher-MR outperforms MRI- and general-purpose baselines on multiple tasks, demonstrates strong cross-modal retrieval, and provides competitive segmentation and anomaly localization results, highlighting its potential for scalable MRI AI in clinical and research settings. The work emphasizes the importance of data diversity, region-aware supervision, and lightweight decoders for generalizable, efficient MRI analysis, while acknowledging biases and areas for future improvement such as region-level alignment and broader textual diversity.

Abstract

Magnetic Resonance Imaging is a critical imaging modality in clinical diagnosis and research, yet its complexity and heterogeneity hinder scalable, generalizable machine learning. Although foundation models have revolutionized language and vision tasks, their application to MRI remains constrained by data scarcity and narrow anatomical focus. We present Decipher-MR, a 3D MRI-specific vision-language foundation model trained on 200,000 MRI series from over 22,000 studies spanning diverse anatomical regions, sequences, and pathologies. Decipher-MR integrates self-supervised vision learning with report-guided text supervision to build robust representations for broad applications. To enable efficient use, Decipher-MR supports a modular design that enables tuning of lightweight, task-specific decoders attached to a frozen pretrained encoder. Following this setting, we evaluate Decipher-MR across disease classification, demographic prediction, anatomical localization, and cross-modal retrieval, demonstrating consistent improvements over existing foundation models and task-specific approaches. These results position Decipher-MR as a versatile foundation for MRI-based AI in clinical and research settings.

Paper Structure

This paper contains 27 sections, 6 figures, 11 tables.

Figures (6)

  • Figure 1: Overview of Decipher-MR Dataset, Framework, and Evaluation (a) Distribution of the diverse pretraining dataset across age, sex, imaging sequences, body regions, and scanner manufacturers. The number of MRI series is shown for each body region, sequence type, and manufacturer, while the number of MRI studies is shown across age ranges and sex. (b) Overview of the two-stage pretraining framework of Decipher-MR. (c) Evaluation of Decipher-MR in a frozen encoder setup. Extracted embeddings are either used directly for retrieval tasks or paired with tunable and relatively lightweight decoders for specific tasks. Evaluation covers diverse tasks, including classification, cross-modal retrieval, and image localization.
  • Figure 1: Details of the report processing procedure, including the prompt used for the LLM, along with an example of an original unprocessed report and the final processed and restructured output used for pretraining.
  • Figure 2: Evaluation of Classification Probing Tasks (a) Comparison of foundation models on multiple medical image classification tasks using a simple MLP probe on CLS embeddings. (b) Ablation of GE MR foundation models pretrained on different subsets of data: MRI-only, head and neck only, and T2-weighted only. (c) Performance under low-data regimes, where the MLP decoder is trained with varying proportions of labeled data. Full results across all tasks are in Supplementary Table \ref{['tab:classification-disease']}-\ref{['tab:classification-imaging']}. (d) Bias analysis of the top three foundation models, comparing performance when training and testing within or across sexes. For all figures above, mean AUC was used as the evaluation metric, except for age prediction, which was assessed using MAE (lower is better).
  • Figure 2: Examples of ground truth and text query/candidates in cross-modal retrieval tasks. (a)Head-neck tumor pathology retrieval (Source2 dataset): The ground-truth labels correspond to tumor types or specific anatomical subregions affected, and the text queries/candidates are the conclusion sections of reports describing the tumor. (b) Body region retrieval (Source1 dataset): The ground truth category corresponds to the anatomical body region. Text queries/candidates are derived from individual organ-specific descriptions in the report. Full-report queries/candidates are formed by concatenating all such descriptions from a single report.
  • Figure 3: Cross-Modal Retrieval Performance Across Two Datasets (a) Performance on Source1 dataset: Body region retrieval (b) Performance on Source2 dataset: Head–neck tumor pathology retrieval. Metrics include mean average precision (mAP) and precision@N—the proportion of queries where all top-N retrieved items match the correct category (i.e., body region or tumor sub-anatomical location).
  • ...and 1 more figures