Table of Contents
Fetching ...

Triad: Vision Foundation Model for 3D Magnetic Resonance Imaging

Shansong Wang, Mojtaba Safari, Qiang Li, Chih-Wei Chang, Richard LJ Qiu, Justin Roper, David S. Yu, Xiaofeng Yang

TL;DR

Triad tackles the MRI-specific gap in vision foundation models by building Triad, a 3D MRI foundation model trained on TriadMR-131K to learn robust representations across brain, breast, and prostate imaging. It uses an autoencoder-based pretraining regime with organ-independent imaging descriptions to align visual features with semantic text signals, and it evaluates across segmentation, classification, and registration in both within-domain and out-of-domain settings. Across 25 downstream datasets, Triad-based pretraining yields consistent gains over scratch baselines and CT-based pretraining in MRI tasks, though cross-modality transfer can be variable. The work demonstrates the value of modality-aligned, large-scale MRI pretraining and points to future extensions into vision-language modeling and broader multi-modality integration for enhanced clinical utility.

Abstract

Vision foundation models (VFMs) are pre-trained on extensive image datasets to learn general representations for diverse types of data. These models can subsequently be fine-tuned for specific downstream tasks, significantly boosting performance across a broad range of applications. However, existing vision foundation models that claim to be applicable to various clinical tasks are mostly pre-trained on 3D computed tomography (CT), which benefits from the availability of extensive 3D CT databases. Significant differences between CT and magnetic resonance imaging (MRI) in imaging principles, signal characteristics, and data distribution may hinder their practical performance and versatility in MRI-specific applications. Here, we propose Triad, a vision foundation model for 3D MRI. Triad adopts a widely used autoencoder architecture to learn robust representations from 131,170 3D MRI volumes and uses organ-independent imaging descriptions to constrain the semantic distribution of the visual modality. The above pre-training dataset is called Triad-131K, which is currently the largest 3D MRI pre-training dataset. We evaluate Triad across three tasks, namely, organ/tumor segmentation, organ/cancer classification, and medical image registration, in two data modalities (within-domain and out-of-domain) settings using 25 downstream datasets. By initializing models with Triad's pre-trained weights, nnUNet-Triad improves segmentation performance by 2.51% compared to nnUNet-Scratch across 17 datasets. Swin-B-Triad achieves a 3.97% improvement over Swin-B-Scratch in classification tasks across five datasets. SwinUNETR-Triad improves by 4.00% compared to SwinUNETR-Scratch in registration tasks across two datasets. Our study demonstrates that pre-training can improve performance when the data modalities and organs of upstream and downstream tasks are consistent.

Triad: Vision Foundation Model for 3D Magnetic Resonance Imaging

TL;DR

Triad tackles the MRI-specific gap in vision foundation models by building Triad, a 3D MRI foundation model trained on TriadMR-131K to learn robust representations across brain, breast, and prostate imaging. It uses an autoencoder-based pretraining regime with organ-independent imaging descriptions to align visual features with semantic text signals, and it evaluates across segmentation, classification, and registration in both within-domain and out-of-domain settings. Across 25 downstream datasets, Triad-based pretraining yields consistent gains over scratch baselines and CT-based pretraining in MRI tasks, though cross-modality transfer can be variable. The work demonstrates the value of modality-aligned, large-scale MRI pretraining and points to future extensions into vision-language modeling and broader multi-modality integration for enhanced clinical utility.

Abstract

Vision foundation models (VFMs) are pre-trained on extensive image datasets to learn general representations for diverse types of data. These models can subsequently be fine-tuned for specific downstream tasks, significantly boosting performance across a broad range of applications. However, existing vision foundation models that claim to be applicable to various clinical tasks are mostly pre-trained on 3D computed tomography (CT), which benefits from the availability of extensive 3D CT databases. Significant differences between CT and magnetic resonance imaging (MRI) in imaging principles, signal characteristics, and data distribution may hinder their practical performance and versatility in MRI-specific applications. Here, we propose Triad, a vision foundation model for 3D MRI. Triad adopts a widely used autoencoder architecture to learn robust representations from 131,170 3D MRI volumes and uses organ-independent imaging descriptions to constrain the semantic distribution of the visual modality. The above pre-training dataset is called Triad-131K, which is currently the largest 3D MRI pre-training dataset. We evaluate Triad across three tasks, namely, organ/tumor segmentation, organ/cancer classification, and medical image registration, in two data modalities (within-domain and out-of-domain) settings using 25 downstream datasets. By initializing models with Triad's pre-trained weights, nnUNet-Triad improves segmentation performance by 2.51% compared to nnUNet-Scratch across 17 datasets. Swin-B-Triad achieves a 3.97% improvement over Swin-B-Scratch in classification tasks across five datasets. SwinUNETR-Triad improves by 4.00% compared to SwinUNETR-Scratch in registration tasks across two datasets. Our study demonstrates that pre-training can improve performance when the data modalities and organs of upstream and downstream tasks are consistent.

Paper Structure

This paper contains 28 sections, 3 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Overview of Triad training and evaluation. a. Triad pre-training strategy. Triad implements the reconstruction task based on autoencoders and uses L1 loss for optimization. Imaging descriptions are embedded into vector space to form a distribution, which serves as a supervisory signal to constrain the distribution of visual modalities using Log-ratio loss kim2019deep. The two losses are optimized simultaneously in a multi-task manner. Triad is then evaluated across within-domain tasks and out-of-domain tasks. These include within-domain 3D MRI segmentation, classification, and registration tasks (tasks b, c, and d). And unseen 3D CT/MRI segmentation, classification, and registration tasks (tasks e, f, and g).
  • Figure 2: An overview of the Triad-131K pre-training dataset. a. Describes the name and scale distribution of each dataset in Triad-131K. b. We compare the parameter scale and data scale used by Triad and existing foundation models, and it is obvious that Triad surpasses the existing models on both scales. c. Shows examples of visual volumetric modality and textual modality in Triad-131K. d. Shows the dataset scale distribution of three organs: brain, breast, and prostate.
  • Figure 3: Study on within-domain 3D tumor segmentation. a. Image segmentation with encoder-decoder architecture by loading the weights of Triad. b. We compare the performance of Scratch, VoCo-SSL and Triad on 5 within-domain datasets based on 3 architectures: Swin-B/L/H. c. We select the nnUNet and Swin-Transformer-Base architectures, along with 3 different weight-loading strategies, and analyze their cross-effects on performance across 5 within-domain datasets. d. Consistent with the setting in subfig. c., the radar chart of each category shows the overall advantage of Triad in tumor segmentation.
  • Figure 4: Study on out-of-domain organ/tumor segmentation. a. We select the nnUNet and Swin-Transformer-B architectures, along with three different weight loading strategies, and analyze their cross-effects on performance across six MSD CT datasets. b. Consistent with the setting of subfig. a., the radar chart shows the performance comparison of each category in MM-WHS-MRI, ATLAS-MRI, and MSD-Liver. c. Consistent with the setting of subfig. a., the radar chart shows the performance comparison of each category in Abdoman 1K, Kipa22, and MSD-Pancreas.
  • Figure 5: Study on organ/cancer classification. a. We use an encoder loaded with Triad weights and a two-layer linear classifier for classification tasks. b. Confusion matrices of the 5 datasets when using Swin-B-Triad as the encoder. The meaning of each category number is shown in Table \ref{['table:t2']}. c. We select the 3D UNet and Swin-Transformer-Base architectures, along with 3 different weight loading strategies, and analyze their cross-effects on performance across 5 CT/MRI datasets. d. Consistent with the setting of subfig. c., we plot the ROC curve of each scheme on 4 datasets.
  • ...and 3 more figures