Table of Contents
Fetching ...

Foundational Models in Medical Imaging: A Comprehensive Survey and Future Vision

Bobby Azad, Reza Azad, Sania Eskandari, Afshin Bozorgpour, Amirhossein Kazerouni, Islem Rekik, Dorit Merhof

TL;DR

This survey maps the rise of foundation models in medical imaging, organizing them into textually prompted and visually prompted families and detailing subtypes by training strategy (contrastive, generative, hybrid, conversational). It covers key models (e.g., MedClip, BiomedCLIP, MedSAM, MedBLIP, LLaVA-Med, MedPaLM M) and their applications across modalities and organs, while analyzing hardware, data, interpretability, and safety challenges. The authors offer a structured taxonomy, analyze practical constraints, and outline open problems and future directions—emphasizing open-source multimodal development, benchmarking, and frequency-aware representations. Collectively, the work provides a roadmap for researchers and clinicians to leverage multimodal, promptable AI for privacy-preserving, data-efficient medical imaging analysis and decision support.

Abstract

Foundation models, large-scale, pre-trained deep-learning models adapted to a wide range of downstream tasks have gained significant interest lately in various deep-learning problems undergoing a paradigm shift with the rise of these models. Trained on large-scale dataset to bridge the gap between different modalities, foundation models facilitate contextual reasoning, generalization, and prompt capabilities at test time. The predictions of these models can be adjusted for new tasks by augmenting the model input with task-specific hints called prompts without requiring extensive labeled data and retraining. Capitalizing on the advances in computer vision, medical imaging has also marked a growing interest in these models. To assist researchers in navigating this direction, this survey intends to provide a comprehensive overview of foundation models in the domain of medical imaging. Specifically, we initiate our exploration by providing an exposition of the fundamental concepts forming the basis of foundation models. Subsequently, we offer a methodical taxonomy of foundation models within the medical domain, proposing a classification system primarily structured around training strategies, while also incorporating additional facets such as application domains, imaging modalities, specific organs of interest, and the algorithms integral to these models. Furthermore, we emphasize the practical use case of some selected approaches and then discuss the opportunities, applications, and future directions of these large-scale pre-trained models, for analyzing medical images. In the same vein, we address the prevailing challenges and research pathways associated with foundational models in medical imaging. These encompass the areas of interpretability, data management, computational requirements, and the nuanced issue of contextual comprehension.

Foundational Models in Medical Imaging: A Comprehensive Survey and Future Vision

TL;DR

This survey maps the rise of foundation models in medical imaging, organizing them into textually prompted and visually prompted families and detailing subtypes by training strategy (contrastive, generative, hybrid, conversational). It covers key models (e.g., MedClip, BiomedCLIP, MedSAM, MedBLIP, LLaVA-Med, MedPaLM M) and their applications across modalities and organs, while analyzing hardware, data, interpretability, and safety challenges. The authors offer a structured taxonomy, analyze practical constraints, and outline open problems and future directions—emphasizing open-source multimodal development, benchmarking, and frequency-aware representations. Collectively, the work provides a roadmap for researchers and clinicians to leverage multimodal, promptable AI for privacy-preserving, data-efficient medical imaging analysis and decision support.

Abstract

Foundation models, large-scale, pre-trained deep-learning models adapted to a wide range of downstream tasks have gained significant interest lately in various deep-learning problems undergoing a paradigm shift with the rise of these models. Trained on large-scale dataset to bridge the gap between different modalities, foundation models facilitate contextual reasoning, generalization, and prompt capabilities at test time. The predictions of these models can be adjusted for new tasks by augmenting the model input with task-specific hints called prompts without requiring extensive labeled data and retraining. Capitalizing on the advances in computer vision, medical imaging has also marked a growing interest in these models. To assist researchers in navigating this direction, this survey intends to provide a comprehensive overview of foundation models in the domain of medical imaging. Specifically, we initiate our exploration by providing an exposition of the fundamental concepts forming the basis of foundation models. Subsequently, we offer a methodical taxonomy of foundation models within the medical domain, proposing a classification system primarily structured around training strategies, while also incorporating additional facets such as application domains, imaging modalities, specific organs of interest, and the algorithms integral to these models. Furthermore, we emphasize the practical use case of some selected approaches and then discuss the opportunities, applications, and future directions of these large-scale pre-trained models, for analyzing medical images. In the same vein, we address the prevailing challenges and research pathways associated with foundational models in medical imaging. These encompass the areas of interpretability, data management, computational requirements, and the nuanced issue of contextual comprehension.
Paper Structure (34 sections, 5 equations, 8 figures, 2 tables)

This paper contains 34 sections, 5 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: The diagram (a) displays the distribution of published papers categorized by their algorithm, (b) categorizes them by their imaging modalities, and (c) classifies them by the type of organ concerned. It is worth noting that the total number of papers included in the analysis is 40.
  • Figure 2: Visual illustration of how our extensive classification categorizes existing works into textually and visually prompted models, distinct from traditional vision models.
  • Figure 3: The suggested taxonomy for foundational models used in medical imaging research consists of six distinct groups: I) VPM-Generalist, II) TPM-Hybrid, III) TPM-Contrastive, IV) TPM-Generative, V) VPM-Adaptations, and VI) TPM-Conversational. To maintain conciseness, we assign ascending prefix numbers to each category in the paper's name and cite each study accordingly as follows: 1. shi2023generalist, 2. moor2023foundation, 3. zhang2023biomedgpt, 4. tu2023generalist, 5. wu2023generalist, 6. zhou2023foundation, 7. chen2023medblip, 8. bioengineering10030380, 9. tiu2022expert, 10. wang2022medclip, 11. bannur2023learning, 12. liu2023clip, 13. zhang2023large, 14, chen2023towards, 15. lu2023visual, 16. chen2023knowledge, 17. huang2023visual, 18. Xu2023ELIXRTA, 19. huang2023enhancing, 20. zhang2023text, 21. yan2022clinical, 22. singhal2023towards, 23. moor2023med, 24. ma2023segment, 25. deng2023samu, 26. cheng2023sammed2d, 27. nguyen2023lvm, 28. hu2023efficiently, 29. wu2023medical, 30. zhang2023customized, 31. vorontsov2023virchow, 32. wu2023pmcllama, 33. wang2023clinicalgpt, 34. li2023llava, 35. thawkar2023xraygpt, 36. liu2023radiologyllama2, 37. wang2023chatcad, 38. Liu2023DeIDGPTZM, 39. yunxiang2023chatdoctor, 40. shu2023medalpaca
  • Figure 4: Schematic of MI-Zero lu2023visual. A gigapixel WSI is transformed into a set of patches (instances), with each patch being embedded into an aligned visual-language latent space. where the similarity scores between the embeddings of patches and the embeddings of prompts are combined using a permutation-invariant operation like topK max-pooling to generate the classification prediction at the WSI level.
  • Figure 5: The SAM-Med2D pipeline cheng2023sammed2d involves freezing the image encoder and introducing learnable adapter layers within each Transformer block to assimilate domain-specific expertise in the medical domain. The prompt encoder is fine-tuned using point, Bbox, and mask information, with the mask decoder's parameters being updated through interactive training.
  • ...and 3 more figures