Table of Contents
Fetching ...

Vision Language Models in Medicine

Beria Chingnabe Kalpelbe, Angel Gabriel Adaambiik, Wei Peng

TL;DR

Medical Vision-Language Models (Med-VLMs) integrate visual and textual data to enhance diagnostics, reporting, and decision support in healthcare. The review maps the field from foundational multi-modal architectures to state-of-the-art medical models (e.g., BLIP-2, LLaVA, MiniGPT-4) and surveys domain-specific applications (MedViLL, MedCLIP, RadFM, VividMed) and evaluation benchmarks (CheXpert, PMC-VQA, CT-RATE). It highlights current challenges—data scarcity, limited generalization, interpretability, privacy, and workflow integration—and outlines future directions including scalable datasets, cross-modal generalization, and privacy-preserving training. The work emphasizes the clinical significance of rigorous benchmarking, human-centered evaluation, and regulatory considerations to ensure ethical and effective adoption in healthcare.

Abstract

With the advent of Vision-Language Models (VLMs), medical artificial intelligence (AI) has experienced significant technological progress and paradigm shifts. This survey provides an extensive review of recent advancements in Medical Vision-Language Models (Med-VLMs), which integrate visual and textual data to enhance healthcare outcomes. We discuss the foundational technology behind Med-VLMs, illustrating how general models are adapted for complex medical tasks, and examine their applications in healthcare. The transformative impact of Med-VLMs on clinical practice, education, and patient care is highlighted, alongside challenges such as data scarcity, narrow task generalization, interpretability issues, and ethical concerns like fairness, accountability, and privacy. These limitations are exacerbated by uneven dataset distribution, computational demands, and regulatory hurdles. Rigorous evaluation methods and robust regulatory frameworks are essential for safe integration into healthcare workflows. Future directions include leveraging large-scale, diverse datasets, improving cross-modal generalization, and enhancing interpretability. Innovations like federated learning, lightweight architectures, and Electronic Health Record (EHR) integration are explored as pathways to democratize access and improve clinical relevance. This review aims to provide a comprehensive understanding of Med-VLMs' strengths and limitations, fostering their ethical and balanced adoption in healthcare.

Vision Language Models in Medicine

TL;DR

Medical Vision-Language Models (Med-VLMs) integrate visual and textual data to enhance diagnostics, reporting, and decision support in healthcare. The review maps the field from foundational multi-modal architectures to state-of-the-art medical models (e.g., BLIP-2, LLaVA, MiniGPT-4) and surveys domain-specific applications (MedViLL, MedCLIP, RadFM, VividMed) and evaluation benchmarks (CheXpert, PMC-VQA, CT-RATE). It highlights current challenges—data scarcity, limited generalization, interpretability, privacy, and workflow integration—and outlines future directions including scalable datasets, cross-modal generalization, and privacy-preserving training. The work emphasizes the clinical significance of rigorous benchmarking, human-centered evaluation, and regulatory considerations to ensure ethical and effective adoption in healthcare.

Abstract

With the advent of Vision-Language Models (VLMs), medical artificial intelligence (AI) has experienced significant technological progress and paradigm shifts. This survey provides an extensive review of recent advancements in Medical Vision-Language Models (Med-VLMs), which integrate visual and textual data to enhance healthcare outcomes. We discuss the foundational technology behind Med-VLMs, illustrating how general models are adapted for complex medical tasks, and examine their applications in healthcare. The transformative impact of Med-VLMs on clinical practice, education, and patient care is highlighted, alongside challenges such as data scarcity, narrow task generalization, interpretability issues, and ethical concerns like fairness, accountability, and privacy. These limitations are exacerbated by uneven dataset distribution, computational demands, and regulatory hurdles. Rigorous evaluation methods and robust regulatory frameworks are essential for safe integration into healthcare workflows. Future directions include leveraging large-scale, diverse datasets, improving cross-modal generalization, and enhancing interpretability. Innovations like federated learning, lightweight architectures, and Electronic Health Record (EHR) integration are explored as pathways to democratize access and improve clinical relevance. This review aims to provide a comprehensive understanding of Med-VLMs' strengths and limitations, fostering their ethical and balanced adoption in healthcare.

Paper Structure

This paper contains 63 sections, 5 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Comprehensive Framework for Medical Vision-Language Models (VLMs). (a) Training involves processing diverse inputs such as images, texts, metadata, and historical data, followed by pre-training. (b) Benchmarking is conducted on a variety of medical datasets including GMAI-MMBench, OmniMedVQA, RadBench, and others. (c) Advanced training strategies are employed, such as vision-text alignment, knowledge distillation, masked language modeling, contrastive learning, and parameter-efficient tuning. (d) Evaluation strategies encompass automated metrics like BLEU, ROUGE, BERTScore, and clinical-specific tools like CheXpert Labeler and RadGraph, alongside human evaluation. (e) Integration of VLMs into the medical workflow leverages contextual data to provide actionable insights and improve clinical decision-making.
  • Figure 2: Architecture of VisualBERTli2019visualbertsimpleperformantbaseline. This model integrates visual and textual inputs using a transformer-based architecture. Text tokens (e.g., "No focal consolidation, effusion or pneumothorax") and visual features extracted from the corresponding image are combined, along with positional and segment embeddings. The model is trained with dual objectives: masked language modeling (Objective 1) and visual-text alignment (Objective 2). This allows VisualBERT to effectively learn contextual representations that align both modalities for downstream tasks.
  • Figure 3: Architecture of MedViLL Johnson2019. The model combines visual and language embeddings to enable joint representation learning for medical applications. (A) Visual embeddings are generated using random pixel sampling and positional encodings from medical images (e.g., X-rays). (B) Language embeddings incorporate tokens with segment and positional encodings from corresponding reports. Both embeddings are processed in (C) a joint embedding space through a bidirectional auto-regressive self-attention mechanism within a transformer. The model supports two primary tasks: image-report matching and masked language modeling, enabling robust multimodal understanding for clinical applications.
  • Figure 4: Summary of the approach for CLIP radford2021learningtransferablevisualmodels: Contrastive pre-training aligns image and text embeddings (1), enabling the creation of dataset classifiers from textual labels (2), and facilitating zero-shot predictions by matching image embeddings with textual prompts (3).
  • Figure 5: Architecture of VividMedluo2024vividmedvisionlanguagemodel: Combines a ViT encoder for image embedding and a localization decoder for binary set prediction and spatial region identification, while leveraging a Large Language Model to generate medical descriptions based on multimodal inputs, including mask and box queries.
  • ...and 4 more figures