How Does Diverse Interpretability of Textual Prompts Impact Medical Vision-Language Zero-Shot Tasks?

Sicheng Wang; Che Liu; Rossella Arcucci

How Does Diverse Interpretability of Textual Prompts Impact Medical Vision-Language Zero-Shot Tasks?

Sicheng Wang, Che Liu, Rossella Arcucci

TL;DR

This work systematically assesses how diverse textual prompts influence zero-shot medical vision-language tasks across three mainstream MedVLP models (BioViL, MedKLIP, KAD) and three chest X-ray datasets. By designing six prompt styles with interpretable ratings and evaluating on seen and unseen diseases, the study reveals substantial robustness gaps, with model performance fluctuating across prompt styles and, in some cases, improving with more informative prompts for unseen diseases. MedKLIP shows the strongest sensitivity to prompt style but also benefits from highly interpretable prompts for unseen classes; BioViL remains relatively stable but generally weaker, while KAD is powerful yet highly prompt-sensitive. Based on these findings, the authors propose a practical pretraining recipe emphasizing domain knowledge integration, informative textual pretraining, and exposure to diverse prompt styles to enhance robustness in future MedVLP systems. The results underscore the need for robustness to diverse prompts to ensure reliable clinical deployment of medical vision-language models.

Abstract

Recent advancements in medical vision-language pre-training (MedVLP) have significantly enhanced zero-shot medical vision tasks such as image classification by leveraging large-scale medical image-text pair pre-training. However, the performance of these tasks can be heavily influenced by the variability in textual prompts describing the categories, necessitating robustness in MedVLP models to diverse prompt styles. Yet, this sensitivity remains underexplored. In this work, we are the first to systematically assess the sensitivity of three widely-used MedVLP methods to a variety of prompts across 15 different diseases. To achieve this, we designed six unique prompt styles to mirror real clinical scenarios, which were subsequently ranked by interpretability. Our findings indicate that all MedVLP models evaluated show unstable performance across different prompt styles, suggesting a lack of robustness. Additionally, the models' performance varied with increasing prompt interpretability, revealing difficulties in comprehending complex medical concepts. This study underscores the need for further development in MedVLP methodologies to enhance their robustness to diverse zero-shot prompts.

How Does Diverse Interpretability of Textual Prompts Impact Medical Vision-Language Zero-Shot Tasks?

TL;DR

Abstract

Paper Structure (20 sections, 6 figures, 14 tables)

This paper contains 20 sections, 6 figures, 14 tables.

Introduction
Related Work
General Vision Language Pre-training
Medical Zero-shot Classification Task
Prompt Engineering for Zero-shot Task
Methods
Overview
Preliminary
Design of Diverse Prompts
Prompt Interpretability Rating
Experimental Setting
Datasets
Implementation
Results and Analysis
Performance on Seen Classes
...and 5 more sections

Figures (6)

Figure 1: Comparison of original prompt and six style prompts' zero-shot image classification performance on seen disease classes of BioViLBioViL, MedKLIPMedKLIP and KADKAD. The X-axis shows the AUC performance of the original prompt, and the Y-axis shows the macro average of AUC performance of six style prompts. The dashed line shows the ideal scenario, where the model shows consistent performance on seen classes regardless of the prompt style.
Figure 2: Framework of Three Mainstream MedVLP Models. BioViL: Phase 1 conducts a Masked Language Modelling (MLM) on a diverse corpus, including PubMed abstracts PubMed, MIMIC-III clinical notes MIMIC-III, and MIMIC-CXR radiology reports MIMIC-CXR. Phase 2 involves textual contrastive learning between the Findings section and the Impression section of MIMIC-CXR reports. Phase 3 projects encoded image and text representations into a global space, then applies contrastive learning between them. MedKLIP: Pre-training involves extracting entity, position, and existence triplets from MIMIC-CXR reports. The model then translates simple entities into detailed descriptions and feeds these triplets into the fusion module together with encoded X-ray images. Lastly, it applies contrastive learning between image and text representations, and supervised learning based on the prediction results. KAD: Phase 1 pre-trains the knowledge-enhanced text encoder by applying contrastive learning between definition and concept pairs extracted from the Unified Medical Language System (UMLS) knowledge graph. Phase 2 applies combined contrastive learning between encoded entities and images and supervised learning on the disease query network by randomly selecting encoded entity and image pairs.
Figure 3: Pipeline of diverse prompt generation and interpretability score rating.
Figure 4: Heatmap demonstrating the performance of different models on seen disease classes with all prompt styles. The best performing prompt style of each disease class is highlighted with thick cell border and italic font.
Figure 5: Bar charts demonstrating the performance difference of non-baseline prompt styles with baseline prompt style of different models on seen disease classes.
...and 1 more figures

How Does Diverse Interpretability of Textual Prompts Impact Medical Vision-Language Zero-Shot Tasks?

TL;DR

Abstract

How Does Diverse Interpretability of Textual Prompts Impact Medical Vision-Language Zero-Shot Tasks?

Authors

TL;DR

Abstract

Table of Contents

Figures (6)