Table of Contents
Fetching ...

Free Lunch in Medical Image Foundation Model Pre-training via Randomized Synthesis and Disentanglement

Yuhan Wei, Yuting He, Linshan Wu, Fuxiang Huang, Junlin Hou, Hao Chen

TL;DR

RaSD proposes a paradigm-shifting approach to medical image foundation model pre-training by fully synthetic, on-the-fly data generation using randomized Gaussian structures and appearance variations. Through prototype disentangling learning, RaSD encourages region-wise semantic decoupling and cohesive regional features, enabling robust transfer across 6 modalities, 48 datasets, and 56 downstream tasks. Across 3D CT/MRI, 2D X-ray, ultrasound, fundus, and pathology domains, RaSD matches or surpasses real-data pre-trained baselines on many tasks, while offering zero data storage, privacy preservation, and scalable online training. The results suggest synthetic data alone can support scalable, generalizable medical AI foundations, with broader implications for privacy-conscious clinical deployment and rapid model expansion.

Abstract

Medical image foundation models (MIFMs) have demonstrated remarkable potential for a wide range of clinical tasks, yet their development is constrained by the scarcity, heterogeneity, and high cost of large-scale annotated datasets. Here, we propose RaSD (Randomized Synthesis and Disentanglement), a scalable framework for pre-training MIFMs entirely on synthetic data. By modeling anatomical structures and appearance variations with randomized Gaussian distributions, RaSD exposes models to sufficient multi-scale structural and appearance perturbations, forcing them to rely on invariant and task-relevant anatomical cues rather than dataset-specific textures, thereby enabling robust and transferable representation learning. We pre-trained RaSD on 1.2 million 3D volumes and 9.6 million 2D images, and extensively evaluated the resulting models across 6 imaging modalities, 48 datasets, and 56 downstream tasks. Across all evaluated downstream tasks, RaSD consistently outperforms training-from-scratch models, achieves the best performance on 17 tasks, and remains comparable to models pre-trained on large real datasets in most others. These results demonstrate that the capacity of synthetic data alone to drive robust representation learning. Our findings establish a paradigm shift in medical AI, demonstrating that synthetic data can serve as a "free lunch" for scalable, privacy-preserving, and clinically generalizable foundation models.

Free Lunch in Medical Image Foundation Model Pre-training via Randomized Synthesis and Disentanglement

TL;DR

RaSD proposes a paradigm-shifting approach to medical image foundation model pre-training by fully synthetic, on-the-fly data generation using randomized Gaussian structures and appearance variations. Through prototype disentangling learning, RaSD encourages region-wise semantic decoupling and cohesive regional features, enabling robust transfer across 6 modalities, 48 datasets, and 56 downstream tasks. Across 3D CT/MRI, 2D X-ray, ultrasound, fundus, and pathology domains, RaSD matches or surpasses real-data pre-trained baselines on many tasks, while offering zero data storage, privacy preservation, and scalable online training. The results suggest synthetic data alone can support scalable, generalizable medical AI foundations, with broader implications for privacy-conscious clinical deployment and rapid model expansion.

Abstract

Medical image foundation models (MIFMs) have demonstrated remarkable potential for a wide range of clinical tasks, yet their development is constrained by the scarcity, heterogeneity, and high cost of large-scale annotated datasets. Here, we propose RaSD (Randomized Synthesis and Disentanglement), a scalable framework for pre-training MIFMs entirely on synthetic data. By modeling anatomical structures and appearance variations with randomized Gaussian distributions, RaSD exposes models to sufficient multi-scale structural and appearance perturbations, forcing them to rely on invariant and task-relevant anatomical cues rather than dataset-specific textures, thereby enabling robust and transferable representation learning. We pre-trained RaSD on 1.2 million 3D volumes and 9.6 million 2D images, and extensively evaluated the resulting models across 6 imaging modalities, 48 datasets, and 56 downstream tasks. Across all evaluated downstream tasks, RaSD consistently outperforms training-from-scratch models, achieves the best performance on 17 tasks, and remains comparable to models pre-trained on large real datasets in most others. These results demonstrate that the capacity of synthetic data alone to drive robust representation learning. Our findings establish a paradigm shift in medical AI, demonstrating that synthetic data can serve as a "free lunch" for scalable, privacy-preserving, and clinically generalizable foundation models.
Paper Structure (39 sections, 6 equations, 13 figures, 21 tables)

This paper contains 39 sections, 6 equations, 13 figures, 21 tables.

Figures (13)

  • Figure 1: Overview of the proposed RaSD framework. a) Comparison between the conventional real data paradigm and our RaSD paradigm. Real data-based pre-training of MIFMs suffers from costly and restricted data acquisition, labor-intensive manual annotation, and large storage demands, leading to limited discriminative capacity. In contrast, RaSD synthesizes diverse image–label pairs from randomized distributions in a streaming manner, requiring no pre-generated data storage. b) Pre-training data scale of existing foundation models compared with RaSD. Unlike prior FMs that depend on large-scale real-image datasets, RaSD achieves large-scale pre-training solely from synthetic data at zero real-data cost. c) Core components of RaSD. Randomized images and masks are generated on-the-fly, and the model is pre-trained to disentangle structural features and learn discriminative representations. d) Real-data adaptation. RaSD-pretrained FMs can be efficiently fine-tuned across diverse modalities (e.g., MRI, CT, X-ray, mammography, pathology and fundus), enabling efficient adaptation to task-specific models. e) Benchmarking results. RaSD-pretrained models (2D/3D) achieve competitive or superior performance compared with state-of-the-art foundation models across 48 datasets and 56 downstream tasks, demonstrating strong transferability and robustness.
  • Figure 2: Our evaluations across diverse 3D radiology downstream tasks (CT and MR) demonstrate the strong transferability of RaSD. For CT tasks (a–n), RaSD covers multiple organs and anatomical structures across different body regions (e.g., abdomen, chest, and vertebrae), delivering competitive performance despite being trained solely on synthetic data, highlighting its potential as a universal MIFM pre-training strategy. For MR tasks (o–t), spanning multiple organs and anatomical regions (e.g., brain, heart and knee), RaSD likewise achieves stable gains. RaSD achieves consistent effectiveness across both CT and MR modalities, validating its ability to capture transferable cross-modal knowledge from synthetic data.
  • Figure 3: Our evaluations across 16 X-ray datasets and 4 ultrasoud datasets demonstrate the broad applicability of RaSD. For chest and skeletal X-rays (a–f), RaSD achieves competitive results on classification (a–c) and segmentation (d–f), confirming its strong transferability across major diagnostic applications. For mammography (g–p), spanning classification (g–l), detection (m, n), and segmentation (o, p), RaSD consistently delivers strong generalization despite being trained solely on synthetic data. For ultrasound tasks (q-t), RaSD surpasses from scratch model and is comparative to other methods. These results highlight the robustness of RaSD in capturing transferable representations across diverse X-ray sub-modalities and task types. The "Composition", "BI-RAID", and "Pathology" are the three standard criteria in breast cancer diagnosis fowler2013breast.
  • Figure 4: RaSD demonstrates robust generalization to both 2D fundus and pathology tasks. Evaluations span Fundus (a–g) and pathology (h-p). For fundus segmentation tasks (a–e), spanning vessel, optic disc, and ridge structures across five benchmark datasets, RaSD achieves competitive Dice performance compared with SOTA FMs. For fundus classification tasks (f-g), covering retinal disease identification, RaSD likewise delivers stable AUC improvements over the scratch baseline and matches or surpasses real-data pre-trained models. Across nine pathology segmentation datasets (h-p), RaSD consistently matches or surpasses the performance of specialized models, achieving SOTA results on key datasets such as CoCaHis and Janowczyk. Collectively, these results across fundus and pathology modality validate RaSD's capacity to learn broadly transferable visual representations from synthetic data.
  • Figure 5: The t-SNE visualization of learned representations on AMOS-MR, VerSe20 and CC-CCII datasets, which demonstrates that RaSD produces more semantically coherent clusters.
  • ...and 8 more figures