Table of Contents
Fetching ...

Semantic Consistency-Based Uncertainty Quantification for Factuality in Radiology Report Generation

Chenyu Wang, Weichao Zhou, Shantanu Ghosh, Kayhan Batmanghelich, Wenchao Li

TL;DR

This work tackles hallucinations in radiology report generation by introducing a plug-and-play semantic-consistency-based uncertainty quantification (SCUQ) framework that requires no model modification. It provides both report-level and sentence-level uncertainty estimates by sampling multiple outputs, parsing semantic content with RadGraph, and evaluating consistency via a GREEN-based factuality metric. The method improves factual accuracy through abstention of high-uncertainty reports and flags high-uncertainty sentences for radiologist review, achieving about a ten percent gain in factuality on MIMIC-CXR when abstaining twenty percent of outputs and identifying low-precision sentences with over eighty percent success. Open-source and model-agnostic, SCUQ demonstrates strong alignment with factuality across models and metrics, offering practical guidance for clinically reliable radiology report generation and potential integration into generation workflows.

Abstract

Radiology report generation (RRG) has shown great potential in assisting radiologists by automating the labor-intensive task of report writing. While recent advancements have improved the quality and coherence of generated reports, ensuring their factual correctness remains a critical challenge. Although generative medical Vision Large Language Models (VLLMs) have been proposed to address this issue, these models are prone to hallucinations and can produce inaccurate diagnostic information. To address these concerns, we introduce a novel Semantic Consistency-Based Uncertainty Quantification framework that provides both report-level and sentence-level uncertainties. Unlike existing approaches, our method does not require modifications to the underlying model or access to its inner state, such as output token logits, thus serving as a plug-and-play module that can be seamlessly integrated with state-of-the-art models. Extensive experiments demonstrate the efficacy of our method in detecting hallucinations and enhancing the factual accuracy of automatically generated radiology reports. By abstaining from high-uncertainty reports, our approach improves factuality scores by $10$\%, achieved by rejecting $20$\% of reports using the \texttt{Radialog} model on the MIMIC-CXR dataset. Furthermore, sentence-level uncertainty flags the lowest-precision sentence in each report with an $82.9$\% success rate. Our implementation is open-source and available at https://github.com/BU-DEPEND-Lab/SCUQ-RRG.

Semantic Consistency-Based Uncertainty Quantification for Factuality in Radiology Report Generation

TL;DR

This work tackles hallucinations in radiology report generation by introducing a plug-and-play semantic-consistency-based uncertainty quantification (SCUQ) framework that requires no model modification. It provides both report-level and sentence-level uncertainty estimates by sampling multiple outputs, parsing semantic content with RadGraph, and evaluating consistency via a GREEN-based factuality metric. The method improves factual accuracy through abstention of high-uncertainty reports and flags high-uncertainty sentences for radiologist review, achieving about a ten percent gain in factuality on MIMIC-CXR when abstaining twenty percent of outputs and identifying low-precision sentences with over eighty percent success. Open-source and model-agnostic, SCUQ demonstrates strong alignment with factuality across models and metrics, offering practical guidance for clinically reliable radiology report generation and potential integration into generation workflows.

Abstract

Radiology report generation (RRG) has shown great potential in assisting radiologists by automating the labor-intensive task of report writing. While recent advancements have improved the quality and coherence of generated reports, ensuring their factual correctness remains a critical challenge. Although generative medical Vision Large Language Models (VLLMs) have been proposed to address this issue, these models are prone to hallucinations and can produce inaccurate diagnostic information. To address these concerns, we introduce a novel Semantic Consistency-Based Uncertainty Quantification framework that provides both report-level and sentence-level uncertainties. Unlike existing approaches, our method does not require modifications to the underlying model or access to its inner state, such as output token logits, thus serving as a plug-and-play module that can be seamlessly integrated with state-of-the-art models. Extensive experiments demonstrate the efficacy of our method in detecting hallucinations and enhancing the factual accuracy of automatically generated radiology reports. By abstaining from high-uncertainty reports, our approach improves factuality scores by \%, achieved by rejecting \% of reports using the \texttt{Radialog} model on the MIMIC-CXR dataset. Furthermore, sentence-level uncertainty flags the lowest-precision sentence in each report with an \% success rate. Our implementation is open-source and available at https://github.com/BU-DEPEND-Lab/SCUQ-RRG.

Paper Structure

This paper contains 29 sections, 7 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Pipeline of proposed Uncertainty Quantification Framework. Given an X-ray image $x_{i}$, the LLM generates an original report $\hat{r}_i$ and sampled reports $\{\tilde{r}_i^t\}_{t=1}^T$. These reports are first processed by a semantic parser $g$, which extracts entity-label pairs for each sentence in $\hat{r}_i$. The uncertainty quantification module evaluates semantic consistency at both the report and sentence levels, providing a comprehensive, layered view of uncertainty for the generated report.
  • Figure 2: Effect of Report Abstention on Factuality Score across UQ for the RaDialog model. The percentages in boxes represent the improvement(only top-2 visualized) in factuality score after abstention, relative to the initial performance without abstention.
  • Figure 3: Effect of VRO-GREEN Guided Abstention on Prior References and Substrings for the Radialog model. Solid lines represent VRO-GREEN Guided Abstention, with dashed red lines as the baseline performing random abstention.
  • Figure 4: Two separate analyses of report- and sentence-level UQ in radiology report generation using MIMIC-CXR data. (a) The report-level UQ study assigns an uncertainty score to the entire report. (b) The sentence-level UQ study ranks individual sentences by uncertainty, with red (1.0) indicating high uncertainty, orange (0.75) indicating moderate uncertainty, and green (0.47) indicating low uncertainty. This color-coded ranking helps inform radiologists on which sentences may require closer attention.
  • Figure 5: Effect of Report Abstention on Factuality Score across UQ for the CheXpertPlus_mimiccxr model. The percentages in boxes represent the improvement in factuality score after abstention, relative to the initial performance without abstention. We assume API-only access to the model, so only lexical similarity is compared in the figure.
  • ...and 3 more figures