Table of Contents
Fetching ...

Foundation Models in Medical Imaging: A Review and Outlook

Vivien van Veldhuizen, Vanessa Botha, Chunyao Lu, Melis Erdal Cesur, Kevin Groot Lipman, Edwin D. de Jong, Hugo Horlings, Clárisa I. Sanchez, Cees G. M. Snoek, Lodewyk Wessels, Ritse Mann, Eric Marcus, Jonas Teuwen

TL;DR

This review surveys vision-based foundation models in medical imaging, focusing on pathology, radiology, and ophthalmology. It clarifies the FM pipeline—backbone architectures, self-supervised learning, and downstream adaptation—and documents advances from tile-level to slide-level and multimodal FMs, including vision-language and SAM integrations. The authors highlight the pivotal role of in-domain, large-scale SSL and discuss practical considerations like data access, 3D modalities, robustness, and governance as barriers to clinical deployment. The work underscores data curation and domain-specific adaptations as key levers for performance gains, while calling for standardized benchmarks and responsible regulation to enable safe, scalable clinical adoption.

Abstract

Foundation models (FMs) are changing the way medical images are analyzed by learning from large collections of unlabeled data. Instead of relying on manually annotated examples, FMs are pre-trained to learn general-purpose visual features that can later be adapted to specific clinical tasks with little additional supervision. In this review, we examine how FMs are being developed and applied in pathology, radiology, and ophthalmology, drawing on evidence from over 150 studies. We explain the core components of FM pipelines, including model architectures, self-supervised learning methods, and strategies for downstream adaptation. We also review how FMs are being used in each imaging domain and compare design choices across applications. Finally, we discuss key challenges and open questions to guide future research.

Foundation Models in Medical Imaging: A Review and Outlook

TL;DR

This review surveys vision-based foundation models in medical imaging, focusing on pathology, radiology, and ophthalmology. It clarifies the FM pipeline—backbone architectures, self-supervised learning, and downstream adaptation—and documents advances from tile-level to slide-level and multimodal FMs, including vision-language and SAM integrations. The authors highlight the pivotal role of in-domain, large-scale SSL and discuss practical considerations like data access, 3D modalities, robustness, and governance as barriers to clinical deployment. The work underscores data curation and domain-specific adaptations as key levers for performance gains, while calling for standardized benchmarks and responsible regulation to enable safe, scalable clinical adoption.

Abstract

Foundation models (FMs) are changing the way medical images are analyzed by learning from large collections of unlabeled data. Instead of relying on manually annotated examples, FMs are pre-trained to learn general-purpose visual features that can later be adapted to specific clinical tasks with little additional supervision. In this review, we examine how FMs are being developed and applied in pathology, radiology, and ophthalmology, drawing on evidence from over 150 studies. We explain the core components of FM pipelines, including model architectures, self-supervised learning methods, and strategies for downstream adaptation. We also review how FMs are being used in each imaging domain and compare design choices across applications. Finally, we discuss key challenges and open questions to guide future research.

Paper Structure

This paper contains 67 sections, 2 figures, 7 tables.

Figures (2)

  • Figure 1: Overview of core technical concepts behind foundation models for medical image analysis. Dotted lines represent optional components. The backbone architecture consists of vision encoders processing imaging data, while auxiliary encoders incorporate complementary modalities in multimodal settings, such as clinical notes, patient records, or molecular data. Self-supervised learning methods, such as contrastive, generative, and distillation approaches, are used to pretrain the model and learn rich representations from the data. In multimodal scenarios, representations from different encoders can be fused before adaptation to downstream tasks. Adaptation strategies range from lightweight, zero/few-shot methods to heavier approaches such as full fine-tuning.
  • Figure 2: Two simplified examples of how different architectures, SSL techniques, and adaptation methods can be combined in foundation models for medical image analysis. Top row: vision encoder backbone is pre-trained using masked autoencoding, in which it learns to reconstruct missing parts of the training images. The model learns to encode images in a meaningful way. After pre-training, a small labeled training dataset is encoded by the VFM, and these embeddings serve as input for training a small tumor classification head. Bottom row: Both a vision encoder and text encoder are trained. In the pre-training phase, the VLFM learns to group similar images and text through a contrastive SSL objective. After pre-training, the model is prompted in a zero-shot way; it is given a new image with a corresponding question and must predict the answer without any additional fine-tuning.