Table of Contents
Fetching ...

Delving into Out-of-Distribution Detection with Medical Vision-Language Models

Lie Ju, Sijin Zhou, Yukun Zhou, Huimin Lu, Zhuoting Zhu, Pearse A. Keane, Zongyuan Ge

TL;DR

This paper tackles the critical challenge of out-of-distribution detection in medical vision-language models (VLMs), where variability in medical imaging can lead to overconfident or erroneous predictions. It conducts a systematic evaluation of state-of-the-art CLIP-like OOD methods across both general-purpose and domain-specific medical VLMs, under a novel cross-modality benchmark that includes semantic and covariate shifts. The authors introduce a hierarchical prompt-based approach, along with a few-shot fine-tuning variant, to enhance OOD separability and robustness. Findings show that while domain-specific VLMs excel ID-wise, their OOD performance benefits significantly from hierarchical prompts and limited fine-tuning, offering a practical path toward more trustworthy medical AI systems in real-world settings.

Abstract

Recent advances in medical vision-language models (VLMs) demonstrate impressive performance in image classification tasks, driven by their strong zero-shot generalization capabilities. However, given the high variability and complexity inherent in medical imaging data, the ability of these models to detect out-of-distribution (OOD) data in this domain remains underexplored. In this work, we conduct the first systematic investigation into the OOD detection potential of medical VLMs. We evaluate state-of-the-art VLM-based OOD detection methods across a diverse set of medical VLMs, including both general and domain-specific purposes. To accurately reflect real-world challenges, we introduce a cross-modality evaluation pipeline for benchmarking full-spectrum OOD detection, rigorously assessing model robustness against both semantic shifts and covariate shifts. Furthermore, we propose a novel hierarchical prompt-based method that significantly enhances OOD detection performance. Extensive experiments are conducted to validate the effectiveness of our approach. The codes are available at https://github.com/PyJulie/Medical-VLMs-OOD-Detection.

Delving into Out-of-Distribution Detection with Medical Vision-Language Models

TL;DR

This paper tackles the critical challenge of out-of-distribution detection in medical vision-language models (VLMs), where variability in medical imaging can lead to overconfident or erroneous predictions. It conducts a systematic evaluation of state-of-the-art CLIP-like OOD methods across both general-purpose and domain-specific medical VLMs, under a novel cross-modality benchmark that includes semantic and covariate shifts. The authors introduce a hierarchical prompt-based approach, along with a few-shot fine-tuning variant, to enhance OOD separability and robustness. Findings show that while domain-specific VLMs excel ID-wise, their OOD performance benefits significantly from hierarchical prompts and limited fine-tuning, offering a practical path toward more trustworthy medical AI systems in real-world settings.

Abstract

Recent advances in medical vision-language models (VLMs) demonstrate impressive performance in image classification tasks, driven by their strong zero-shot generalization capabilities. However, given the high variability and complexity inherent in medical imaging data, the ability of these models to detect out-of-distribution (OOD) data in this domain remains underexplored. In this work, we conduct the first systematic investigation into the OOD detection potential of medical VLMs. We evaluate state-of-the-art VLM-based OOD detection methods across a diverse set of medical VLMs, including both general and domain-specific purposes. To accurately reflect real-world challenges, we introduce a cross-modality evaluation pipeline for benchmarking full-spectrum OOD detection, rigorously assessing model robustness against both semantic shifts and covariate shifts. Furthermore, we propose a novel hierarchical prompt-based method that significantly enhances OOD detection performance. Extensive experiments are conducted to validate the effectiveness of our approach. The codes are available at https://github.com/PyJulie/Medical-VLMs-OOD-Detection.

Paper Structure

This paper contains 13 sections, 5 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: (a) Problem illustration. (i) ID classes with textual descriptions seen by CLIP-like models are defined as ID classes (e.g., diabetic retinopathy); (ii) Covariate shifted OOD data sharing semantic relevance with ID classes but exhibiting covariate shifts, such as low image quality or differences in imaging devices (e.g., ultrawide-field fundus imaging). (iii) & (iv) OOD with irrelevant concerns of semantics. (b) A simple baseline experiment demonstrates that advanced OOD detection techniques (e.g., MCM ming2022delving) tend to fail on covariate-shifted OOD scenarios.
  • Figure 2: The fine-tuning pipeline for OOD detection with proposed hierarchical prompts.
  • Figure 3: The comparison results across various GPMs and DSMs.
  • Figure 4: (a) The visualization on mixed OOD score distribution. (b) Few-shot OOD detection results with different numbers of ID training samples for fine-tuning.