Table of Contents
Fetching ...

Grounding Chest X-Ray Visual Question Answering with Generated Radiology Reports

Francesco Dalla Serra, Patrick Schrempf, Chaoyang Wang, Zaiqiao Meng, Fani Deligianni, Alison Q. O'Neil

TL;DR

This work tackles Chest X-Ray Visual Question Answering by introducing the RG-AG pipeline, which generates radiology reports and uses them to ground the subsequent answer generator. The method combines visual anatomical tokens, a longitudinal projection module, and a Transformer-based language model (~$68$M parameters) to handle both single-image and image-difference questions. Grounding the answer with predicted reports yields state-of-the-art results on the Medical-Diff-VQA dataset, with the strongest gains for difference questions and open-ended queries, and is shown to benefit from high-quality expert reports. The findings highlight the value of incorporating radiology-style evidence in VQA while preserving the necessity of visual information, and point to future work in broader clinical data grounding and scalable single-model approaches.

Abstract

We present a novel approach to Chest X-ray (CXR) Visual Question Answering (VQA), addressing both single-image image-difference questions. Single-image questions focus on abnormalities within a specific CXR ("What abnormalities are seen in image X?"), while image-difference questions compare two longitudinal CXRs acquired at different time points ("What are the differences between image X and Y?"). We further explore how the integration of radiology reports can enhance the performance of VQA models. While previous approaches have demonstrated the utility of radiology reports during the pre-training phase, we extend this idea by showing that the reports can also be leveraged as additional input to improve the VQA model's predicted answers. First, we propose a unified method that handles both types of questions and auto-regressively generates the answers. For single-image questions, the model is provided with a single CXR. For image-difference questions, the model is provided with two CXRs from the same patient, captured at different time points, enabling the model to detect and describe temporal changes. Taking inspiration from 'Chain-of-Thought reasoning', we demonstrate that performance on the CXR VQA task can be improved by grounding the answer generator module with a radiology report predicted for the same CXR. In our approach, the VQA model is divided into two steps: i) Report Generation (RG) and ii) Answer Generation (AG). Our results demonstrate that incorporating predicted radiology reports as evidence to the AG model enhances performance on both single-image and image-difference questions, achieving state-of-the-art results on the Medical-Diff-VQA dataset.

Grounding Chest X-Ray Visual Question Answering with Generated Radiology Reports

TL;DR

This work tackles Chest X-Ray Visual Question Answering by introducing the RG-AG pipeline, which generates radiology reports and uses them to ground the subsequent answer generator. The method combines visual anatomical tokens, a longitudinal projection module, and a Transformer-based language model (~M parameters) to handle both single-image and image-difference questions. Grounding the answer with predicted reports yields state-of-the-art results on the Medical-Diff-VQA dataset, with the strongest gains for difference questions and open-ended queries, and is shown to benefit from high-quality expert reports. The findings highlight the value of incorporating radiology-style evidence in VQA while preserving the necessity of visual information, and point to future work in broader clinical data grounding and scalable single-model approaches.

Abstract

We present a novel approach to Chest X-ray (CXR) Visual Question Answering (VQA), addressing both single-image image-difference questions. Single-image questions focus on abnormalities within a specific CXR ("What abnormalities are seen in image X?"), while image-difference questions compare two longitudinal CXRs acquired at different time points ("What are the differences between image X and Y?"). We further explore how the integration of radiology reports can enhance the performance of VQA models. While previous approaches have demonstrated the utility of radiology reports during the pre-training phase, we extend this idea by showing that the reports can also be leveraged as additional input to improve the VQA model's predicted answers. First, we propose a unified method that handles both types of questions and auto-regressively generates the answers. For single-image questions, the model is provided with a single CXR. For image-difference questions, the model is provided with two CXRs from the same patient, captured at different time points, enabling the model to detect and describe temporal changes. Taking inspiration from 'Chain-of-Thought reasoning', we demonstrate that performance on the CXR VQA task can be improved by grounding the answer generator module with a radiology report predicted for the same CXR. In our approach, the VQA model is divided into two steps: i) Report Generation (RG) and ii) Answer Generation (AG). Our results demonstrate that incorporating predicted radiology reports as evidence to the AG model enhances performance on both single-image and image-difference questions, achieving state-of-the-art results on the Medical-Diff-VQA dataset.

Paper Structure

This paper contains 23 sections, 6 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Overview of the Report Generator–Answer Generator (RG-AG) pipeline: (1) the Report Generator first produces a radiology report based on the given Chest X-ray (or a pair of images in the case of a follow-up study), along with the instruction and the indication field. The report consists of the 'finding' and 'impression' sections, which are generated independently based on the specific instructions received by the RG module. (2) The Answer Generator then utilises this predicted report as additional contextual information to generate a more accurate and interpretable response to the input question. Red tags denote clinical input data, green tags indicate the prompts provided to each module, and yellow tags represent the output generated by each module.
  • Figure 2: The model architecture of the Report Generator and Answer Generator. This diagram illustrates on the left the Visual Anatomical Token Extractor -- responsible for extracting the visual tokens from CXRs. This component is trained independently of the Vision-Language Model. On the right, the Vision-Language Model architecture is responsible for generating the radiology report or the answer. The diagram shows how visual inputs (i.e., anatomical tokens $A_{tok}$) are aligned, concatenated and projected into a joint representation via the Longitudinal Projection Module. This representation is then combined with tokenised text inputs, which the language model processes to generate the target text. For the Report Generator, the input text is an instruction requesting the generation of a specific report section (finding or impression), with the target text being that section. For the Answer Generator, the input text is the concatenation of the question and the predicted report (finding + impression), and the target text is the answer.
  • Figure 3: We compare the accuracy of our proposed RG-AG model with the baseline AG model (which does not include the predicted CXR radiology report as input) for each question type, except for the difference questions. We highlight the difference in accuracy ($\Delta$) for each question type.
  • Figure 4: We compare the quality of our predicted answers without the predicted CXR radiology report (AG model) and with it (our RG-AG model). For each question (Q), we highlight the correct parts of the answer (A) in green and the errors in red. Similarly, in the predicted radiology reports (R), segments containing correct information relevant to the question are shown in green.
  • Figure 5: We present borderline and failure cases of our RG-AG model, with explanatory comments in the rightmost column to describe the associated errors. For each question (Q), we highlight the correct parts of the answer (A) in green and the errors in red. Similarly, in the predicted radiology reports (R), segments containing correct information relevant to the question are shown in green, and segments inconsistent with the ground truth answer are shown in red.