Grounding Chest X-Ray Visual Question Answering with Generated Radiology Reports
Francesco Dalla Serra, Patrick Schrempf, Chaoyang Wang, Zaiqiao Meng, Fani Deligianni, Alison Q. O'Neil
TL;DR
This work tackles Chest X-Ray Visual Question Answering by introducing the RG-AG pipeline, which generates radiology reports and uses them to ground the subsequent answer generator. The method combines visual anatomical tokens, a longitudinal projection module, and a Transformer-based language model (~$68$M parameters) to handle both single-image and image-difference questions. Grounding the answer with predicted reports yields state-of-the-art results on the Medical-Diff-VQA dataset, with the strongest gains for difference questions and open-ended queries, and is shown to benefit from high-quality expert reports. The findings highlight the value of incorporating radiology-style evidence in VQA while preserving the necessity of visual information, and point to future work in broader clinical data grounding and scalable single-model approaches.
Abstract
We present a novel approach to Chest X-ray (CXR) Visual Question Answering (VQA), addressing both single-image image-difference questions. Single-image questions focus on abnormalities within a specific CXR ("What abnormalities are seen in image X?"), while image-difference questions compare two longitudinal CXRs acquired at different time points ("What are the differences between image X and Y?"). We further explore how the integration of radiology reports can enhance the performance of VQA models. While previous approaches have demonstrated the utility of radiology reports during the pre-training phase, we extend this idea by showing that the reports can also be leveraged as additional input to improve the VQA model's predicted answers. First, we propose a unified method that handles both types of questions and auto-regressively generates the answers. For single-image questions, the model is provided with a single CXR. For image-difference questions, the model is provided with two CXRs from the same patient, captured at different time points, enabling the model to detect and describe temporal changes. Taking inspiration from 'Chain-of-Thought reasoning', we demonstrate that performance on the CXR VQA task can be improved by grounding the answer generator module with a radiology report predicted for the same CXR. In our approach, the VQA model is divided into two steps: i) Report Generation (RG) and ii) Answer Generation (AG). Our results demonstrate that incorporating predicted radiology reports as evidence to the AG model enhances performance on both single-image and image-difference questions, achieving state-of-the-art results on the Medical-Diff-VQA dataset.
