Prompting Medical Large Vision-Language Models to Diagnose Pathologies by Visual Question Answering

Danfeng Guo; Demetri Terzopoulos

Prompting Medical Large Vision-Language Models to Diagnose Pathologies by Visual Question Answering

Danfeng Guo, Demetri Terzopoulos

TL;DR

This work addresses the hallucination and minority-pathology challenges of medical large vision-language models in visual question answering. It introduces two prompt-based strategies: (1) prompting with detailed explanations of pathologies and (2) leveraging a cheap weak-learner to provide referenced judgments, transforming the task into more reliable diagnostic inferences. Across MIMIC-CXR-JPG and Chexpert, the approach yields meaningful gains in F1 (up to ~0.27) and improves recall under POPE evaluation, with notable reductions in false positives when using weak-learner prompts. The methods generalize to non-medical LVLMs, illustrating a practical, low-cost path to reducing hallucinations and bias-driven errors in VQA systems.

Abstract

Large Vision-Language Models (LVLMs) have achieved significant success in recent years, and they have been extended to the medical domain. Although demonstrating satisfactory performance on medical Visual Question Answering (VQA) tasks, Medical LVLMs (MLVLMs) suffer from the hallucination problem, which makes them fail to diagnose complex pathologies. Moreover, they readily fail to learn minority pathologies due to imbalanced training data. We propose two prompting strategies for MLVLMs that reduce hallucination and improve VQA performance. In the first strategy, we provide a detailed explanation of the queried pathology. In the second strategy, we fine-tune a cheap, weak learner to achieve high performance on a specific metric, and textually provide its judgment to the MLVLM. Tested on the MIMIC-CXR-JPG and Chexpert datasets, our methods significantly improve the diagnostic F1 score, with the highest increase being 0.27. We also demonstrate that our prompting strategies can be extended to general LVLM domains. Based on POPE metrics, it effectively suppresses the false negative predictions of existing LVLMs and improves Recall by approximately 0.07.

Prompting Medical Large Vision-Language Models to Diagnose Pathologies by Visual Question Answering

TL;DR

Abstract

Paper Structure (28 sections, 1 equation, 3 figures, 11 tables)

This paper contains 28 sections, 1 equation, 3 figures, 11 tables.

Introduction
Related Work
LVLMs and VQA
Hallucination in LVLM VQA
Causes of LVLM VQA Hallucination
Mitigation of LVLM VQA Hallucination
Assessment of LVLM Hallucination
VQA in MLVLMs
Medical Image Classification via VLMs
Methodology
Prompting With Detailed Explanations
Prompting With Detailed Explanations and Weak Learners
Empirical Study
Datasets
Implementation Details
...and 13 more sections

Figures (3)

Figure 1: The structure of common LVLMs.
Figure 2: An example of including pathology explanations when prompting an MLVLM for medical VQA.
Figure 3: An example of prompting an MLVLM for medical VQA using both pathology explanations and reference predictions from a weak learner.

Prompting Medical Large Vision-Language Models to Diagnose Pathologies by Visual Question Answering

TL;DR

Abstract

Prompting Medical Large Vision-Language Models to Diagnose Pathologies by Visual Question Answering

Authors

TL;DR

Abstract

Table of Contents

Figures (3)