Vision Language Models in Medicine

Beria Chingnabe Kalpelbe; Angel Gabriel Adaambiik; Wei Peng

Vision Language Models in Medicine

Beria Chingnabe Kalpelbe, Angel Gabriel Adaambiik, Wei Peng

TL;DR

Medical Vision-Language Models (Med-VLMs) integrate visual and textual data to enhance diagnostics, reporting, and decision support in healthcare. The review maps the field from foundational multi-modal architectures to state-of-the-art medical models (e.g., BLIP-2, LLaVA, MiniGPT-4) and surveys domain-specific applications (MedViLL, MedCLIP, RadFM, VividMed) and evaluation benchmarks (CheXpert, PMC-VQA, CT-RATE). It highlights current challenges—data scarcity, limited generalization, interpretability, privacy, and workflow integration—and outlines future directions including scalable datasets, cross-modal generalization, and privacy-preserving training. The work emphasizes the clinical significance of rigorous benchmarking, human-centered evaluation, and regulatory considerations to ensure ethical and effective adoption in healthcare.

Abstract

With the advent of Vision-Language Models (VLMs), medical artificial intelligence (AI) has experienced significant technological progress and paradigm shifts. This survey provides an extensive review of recent advancements in Medical Vision-Language Models (Med-VLMs), which integrate visual and textual data to enhance healthcare outcomes. We discuss the foundational technology behind Med-VLMs, illustrating how general models are adapted for complex medical tasks, and examine their applications in healthcare. The transformative impact of Med-VLMs on clinical practice, education, and patient care is highlighted, alongside challenges such as data scarcity, narrow task generalization, interpretability issues, and ethical concerns like fairness, accountability, and privacy. These limitations are exacerbated by uneven dataset distribution, computational demands, and regulatory hurdles. Rigorous evaluation methods and robust regulatory frameworks are essential for safe integration into healthcare workflows. Future directions include leveraging large-scale, diverse datasets, improving cross-modal generalization, and enhancing interpretability. Innovations like federated learning, lightweight architectures, and Electronic Health Record (EHR) integration are explored as pathways to democratize access and improve clinical relevance. This review aims to provide a comprehensive understanding of Med-VLMs' strengths and limitations, fostering their ethical and balanced adoption in healthcare.

Vision Language Models in Medicine

TL;DR

Abstract

Vision Language Models in Medicine

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (9)