Table of Contents
Fetching ...

Hallucination Filtering in Radiology Vision-Language Models Using Discrete Semantic Entropy

Patrick Wienholt, Sophie Caselitz, Robert Siepmann, Philipp Bruners, Keno Bressem, Christiane Kuhl, Jakob Nikolas Kather, Sven Nebelung, Daniel Truhn

TL;DR

This paper addresses hallucinations in radiology vision-language models by applying discrete semantic entropy (DSE) to detect uncertainty in generated answers. It generates 15 responses per image–question pair from GPT‑4o and GPT‑4.1, clusters semantically equivalent outputs, and computes $DSE(x) = -\sum_{C_i} P(C_i|x) \log_{10} P(C_i|x)$ to decide whether to answer or reject a prompt. DSE-based filtering significantly improves accuracy on the remaining questions (e.g., up to 76.3% for GPT‑4o at $DSE \le 0.3$ and 63.8% for GPT‑4.1) across two datasets, with most gains surviving a Bonferroni-corrected significance threshold. The approach is model-agnostic and suitable for black-box VLM deployments, offering a practical uncertainty signal to bolster safety and trust in clinical AI, though it does not guarantee truth and requires broader prospective validation.

Abstract

To determine whether using discrete semantic entropy (DSE) to reject questions likely to generate hallucinations can improve the accuracy of black-box vision-language models (VLMs) in radiologic image based visual question answering (VQA). This retrospective study evaluated DSE using two publicly available, de-identified datasets: (i) the VQA-Med 2019 benchmark (500 images with clinical questions and short-text answers) and (ii) a diagnostic radiology dataset (206 cases: 60 computed tomography scans, 60 magnetic resonance images, 60 radiographs, 26 angiograms) with corresponding ground-truth diagnoses. GPT-4o and GPT-4.1 answered each question 15 times using a temperature of 1.0. Baseline accuracy was determined using low-temperature answers (temperature 0.1). Meaning-equivalent responses were grouped using bidirectional entailment checks, and DSE was computed from the relative frequencies of the resulting semantic clusters. Accuracy was recalculated after excluding questions with DSE > 0.6 or > 0.3. p-values and 95% confidence intervals were obtained using bootstrap resampling and a Bonferroni-corrected threshold of p < .004 for statistical significance. Across 706 image-question pairs, baseline accuracy was 51.7% for GPT-4o and 54.8% for GPT-4.1. After filtering out high-entropy questions (DSE > 0.3), accuracy on the remaining questions was 76.3% (retained questions: 334/706) for GPT-4o and 63.8% (retained questions: 499/706) for GPT-4.1 (both p < .001). Accuracy gains were observed across both datasets and largely remained statistically significant after Bonferroni correction. DSE enables reliable hallucination detection in black-box VLMs by quantifying semantic inconsistency. This method significantly improves diagnostic answer accuracy and offers a filtering strategy for clinical VLM applications.

Hallucination Filtering in Radiology Vision-Language Models Using Discrete Semantic Entropy

TL;DR

This paper addresses hallucinations in radiology vision-language models by applying discrete semantic entropy (DSE) to detect uncertainty in generated answers. It generates 15 responses per image–question pair from GPT‑4o and GPT‑4.1, clusters semantically equivalent outputs, and computes to decide whether to answer or reject a prompt. DSE-based filtering significantly improves accuracy on the remaining questions (e.g., up to 76.3% for GPT‑4o at and 63.8% for GPT‑4.1) across two datasets, with most gains surviving a Bonferroni-corrected significance threshold. The approach is model-agnostic and suitable for black-box VLM deployments, offering a practical uncertainty signal to bolster safety and trust in clinical AI, though it does not guarantee truth and requires broader prospective validation.

Abstract

To determine whether using discrete semantic entropy (DSE) to reject questions likely to generate hallucinations can improve the accuracy of black-box vision-language models (VLMs) in radiologic image based visual question answering (VQA). This retrospective study evaluated DSE using two publicly available, de-identified datasets: (i) the VQA-Med 2019 benchmark (500 images with clinical questions and short-text answers) and (ii) a diagnostic radiology dataset (206 cases: 60 computed tomography scans, 60 magnetic resonance images, 60 radiographs, 26 angiograms) with corresponding ground-truth diagnoses. GPT-4o and GPT-4.1 answered each question 15 times using a temperature of 1.0. Baseline accuracy was determined using low-temperature answers (temperature 0.1). Meaning-equivalent responses were grouped using bidirectional entailment checks, and DSE was computed from the relative frequencies of the resulting semantic clusters. Accuracy was recalculated after excluding questions with DSE > 0.6 or > 0.3. p-values and 95% confidence intervals were obtained using bootstrap resampling and a Bonferroni-corrected threshold of p < .004 for statistical significance. Across 706 image-question pairs, baseline accuracy was 51.7% for GPT-4o and 54.8% for GPT-4.1. After filtering out high-entropy questions (DSE > 0.3), accuracy on the remaining questions was 76.3% (retained questions: 334/706) for GPT-4o and 63.8% (retained questions: 499/706) for GPT-4.1 (both p < .001). Accuracy gains were observed across both datasets and largely remained statistically significant after Bonferroni correction. DSE enables reliable hallucination detection in black-box VLMs by quantifying semantic inconsistency. This method significantly improves diagnostic answer accuracy and offers a filtering strategy for clinical VLM applications.

Paper Structure

This paper contains 4 sections, 2 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Schematic for discrete semantic entropy (DSE) implementation within clinical radiology workflows. A question asked by a clinician is posed 15 times to the vision-language model (VLM). The same model then clusters the answers based on semantic similarity. Based on the results of clustering, the is calculated, and the question is answered or rejected by the VLM depending on the DSE threshold.
  • Figure 2: Examples where GPT-4o provides correct answers (left) and incorrect answers (right) on the VQA-Med 2019 dataset. If a discrete semantic entropy (DSE) threshold of 0.3 had been applied, the middle left question (DSE = 0.5) and the top and bottom questions on the right would have been filtered out. The middle question on the right illustrates a case where DSE fails to identify a hallucinated response, as the model confidently provided an incorrect answer ("Noncontrast MRI").
  • Figure 3: Sankey diagrams depicting the distribution of accepted and rejected answers across vision–language models (GPT-4o, GPT-4.1) and discrete semantic entropy (DSE) thresholds ($\leq$ 0.6 and $\leq$ 0.3). Each flow originates from question subcategories within the VQA-Med 2019 and RadDataset cohorts and terminates in one of four outcome groups: accepted true, accepted false, rejected true, or rejected false. The proportion of true versus false answers varies substantially across subcategories. Tightening the DSE threshold from 0.6 to 0.3 increases the number of rejected questions and correspondingly improves the accuracy of retained answers by filtering semantically inconsistent responses.
  • Figure 4: Relative accuracy improvement versus reduction in the number of answered questions for GPT-4o (left) and GPT-4.1 (right) after applying discrete semantic entropy (DSE) thresholds of 0.6 and 0.3. Each point represents a subcategory within the evaluated datasets (e.g., organ, abnormality, modality, imaging plane). Stricter thresholds (DSE $\leq$ 0.3) generally yield higher accuracy improvements at the cost of answering fewer questions, illustrating the trade-off between reliability and coverage across subcategories.
  • Figure 5: Accuracy–coverage trade-off under discrete semantic entropy (DSE) filtering. Curves show the absolute accuracy gain (y-axis) versus the fraction of rejected questions (x-axis) as the DSE threshold (numeric labels next to points) is varied. For each setting, questions with DSE above the threshold are discarded. Lower thresholds (stricter filtering) reject more questions but yield larger accuracy improvements. A threshold of 1.2 corresponds to no filtering.