Foundation Models in Medical Imaging: A Review and Outlook
Vivien van Veldhuizen, Vanessa Botha, Chunyao Lu, Melis Erdal Cesur, Kevin Groot Lipman, Edwin D. de Jong, Hugo Horlings, Clárisa I. Sanchez, Cees G. M. Snoek, Lodewyk Wessels, Ritse Mann, Eric Marcus, Jonas Teuwen
TL;DR
This review surveys vision-based foundation models in medical imaging, focusing on pathology, radiology, and ophthalmology. It clarifies the FM pipeline—backbone architectures, self-supervised learning, and downstream adaptation—and documents advances from tile-level to slide-level and multimodal FMs, including vision-language and SAM integrations. The authors highlight the pivotal role of in-domain, large-scale SSL and discuss practical considerations like data access, 3D modalities, robustness, and governance as barriers to clinical deployment. The work underscores data curation and domain-specific adaptations as key levers for performance gains, while calling for standardized benchmarks and responsible regulation to enable safe, scalable clinical adoption.
Abstract
Foundation models (FMs) are changing the way medical images are analyzed by learning from large collections of unlabeled data. Instead of relying on manually annotated examples, FMs are pre-trained to learn general-purpose visual features that can later be adapted to specific clinical tasks with little additional supervision. In this review, we examine how FMs are being developed and applied in pathology, radiology, and ophthalmology, drawing on evidence from over 150 studies. We explain the core components of FM pipelines, including model architectures, self-supervised learning methods, and strategies for downstream adaptation. We also review how FMs are being used in each imaging domain and compare design choices across applications. Finally, we discuss key challenges and open questions to guide future research.
