Table of Contents
Fetching ...

On Large Visual Language Models for Medical Imaging Analysis: An Empirical Study

Minh-Hao Van, Prateek Verma, Xintao Wu

TL;DR

The paper conducts an empirical assessment of zero-shot and few-shot capabilities of visual language models on medical imaging tasks across brain MRI, ALL-IDB2 microscopy, and chest X-ray datasets. It compares five VLMs (BiomedCLIP, OpenCLIP, OpenFlamingo, LLaVA, ChatGPT-4) against CNN baselines, without retraining, to quantify emergent multimodal reasoning in clinical contexts. Findings show CNNs still achieve the highest accuracies, but certain VLMs demonstrate strong zero-shot and few-shot performance, with BiomedCLIP, ChatGPT, and OpenFlamingo excelling on specific datasets; the study also highlights the critical impact of prompt design and demonstration strategies. The results underscore both the promise and current limitations of VLMs in medical image classification, including safety, privacy, and data quality concerns, and point to future directions like segmentation tasks and more domain-specific training.

Abstract

Recently, large language models (LLMs) have taken the spotlight in natural language processing. Further, integrating LLMs with vision enables the users to explore emergent abilities with multimodal data. Visual language models (VLMs), such as LLaVA, Flamingo, or CLIP, have demonstrated impressive performance on various visio-linguistic tasks. Consequently, there are enormous applications of large models that could be potentially used in the biomedical imaging field. Along that direction, there is a lack of related work to show the ability of large models to diagnose the diseases. In this work, we study the zero-shot and few-shot robustness of VLMs on the medical imaging analysis tasks. Our comprehensive experiments demonstrate the effectiveness of VLMs in analyzing biomedical images such as brain MRIs, microscopic images of blood cells, and chest X-rays.

On Large Visual Language Models for Medical Imaging Analysis: An Empirical Study

TL;DR

The paper conducts an empirical assessment of zero-shot and few-shot capabilities of visual language models on medical imaging tasks across brain MRI, ALL-IDB2 microscopy, and chest X-ray datasets. It compares five VLMs (BiomedCLIP, OpenCLIP, OpenFlamingo, LLaVA, ChatGPT-4) against CNN baselines, without retraining, to quantify emergent multimodal reasoning in clinical contexts. Findings show CNNs still achieve the highest accuracies, but certain VLMs demonstrate strong zero-shot and few-shot performance, with BiomedCLIP, ChatGPT, and OpenFlamingo excelling on specific datasets; the study also highlights the critical impact of prompt design and demonstration strategies. The results underscore both the promise and current limitations of VLMs in medical image classification, including safety, privacy, and data quality concerns, and point to future directions like segmentation tasks and more domain-specific training.

Abstract

Recently, large language models (LLMs) have taken the spotlight in natural language processing. Further, integrating LLMs with vision enables the users to explore emergent abilities with multimodal data. Visual language models (VLMs), such as LLaVA, Flamingo, or CLIP, have demonstrated impressive performance on various visio-linguistic tasks. Consequently, there are enormous applications of large models that could be potentially used in the biomedical imaging field. Along that direction, there is a lack of related work to show the ability of large models to diagnose the diseases. In this work, we study the zero-shot and few-shot robustness of VLMs on the medical imaging analysis tasks. Our comprehensive experiments demonstrate the effectiveness of VLMs in analyzing biomedical images such as brain MRIs, microscopic images of blood cells, and chest X-rays.
Paper Structure (9 sections, 4 figures, 2 tables)

This paper contains 9 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: An example of BiomedCLIP in predicting brain tumor type from MRIs. Green highlighted text indicates correct prediction by the method.
  • Figure 2: An example of OpenFlamingo in predicting brain tumor type from MRIs. Green highlighted text indicates correct prediction by the method.
  • Figure 3: An example of LLaVA in predicting brain tumor type from MRIs.
  • Figure 4: An example of ChatGPT-4 in predicting COVID-19 from scans of a chest X-ray.