Pretraining Vision-Language Model for Difference Visual Question Answering in Longitudinal Chest X-rays

Yeongjae Cho; Taehee Kim; Heejun Shin; Sungzoon Cho; Dongmyung Shin

Pretraining Vision-Language Model for Difference Visual Question Answering in Longitudinal Chest X-rays

Yeongjae Cho, Taehee Kim, Heejun Shin, Sungzoon Cho, Dongmyung Shin

TL;DR

The paper tackles difference visual question answering (diff-VQA) in longitudinal chest X-rays, enabling AI to reason about temporal changes between two images. It proposes PLURAL, a vision-language model built on a Transformer and augmented with a past-image input branch, pretrained first on natural image-text data and then on longitudinal chest X-ray data, followed by finetuning on diff-VQA data. Through extensive experiments and ablations on MIMIC-CXR and MIMIC-Diff-VQA, PLURAL achieves state-of-the-art results for diff-VQA and also improves conventional VQA on single images, with ablations confirming the value of temporal inputs and report sections. The work demonstrates that targeted pretraining on domain-specific longitudinal data, combined with architectural modifications for temporal reasoning, yields practical improvements for radiology reading workflows.

Abstract

Difference visual question answering (diff-VQA) is a challenging task that requires answering complex questions based on differences between a pair of images. This task is particularly important in reading chest X-ray images because radiologists often compare multiple images of the same patient taken at different times to track disease progression and changes in its severity in their clinical practice. However, previous works focused on designing specific network architectures for the diff-VQA task, missing opportunities to enhance the model's performance using a pretrained vision-language model (VLM). Here, we introduce a novel VLM called PLURAL, which is pretrained on natural and longitudinal chest X-ray data for the diff-VQA task. The model is developed using a step-by-step approach, starting with being pretrained on natural images and texts, followed by being trained using longitudinal chest X-ray data. The longitudinal data consist of pairs of X-ray images, along with question-answer sets and radiologist's reports that describe the changes in lung abnormalities and diseases over time. Our experimental results show that the PLURAL model outperforms state-of-the-art methods not only in diff-VQA for longitudinal X-rays but also in conventional VQA for a single X-ray image. Through extensive experiments, we demonstrate the effectiveness of the proposed VLM architecture and pretraining method in improving the model's performance.

Pretraining Vision-Language Model for Difference Visual Question Answering in Longitudinal Chest X-rays

TL;DR

Abstract

Paper Structure (17 sections, 3 equations, 5 figures, 4 tables)

This paper contains 17 sections, 3 equations, 5 figures, 4 tables.

Introduction
Method
Vision-Language Model Architecture
Stage 1: pretraining using natural images and texts
Stage 2: pretraining using longitudinal chest X-ray data
Stage 3: finetuning using difference VQA data
Longitudinal Chest X-ray Datasets
MIMIC-CXR
MIMIC-Diff-VQA
Experiments
Difference Visual Question Answering
Ablation Study
Non-difference Visual Question Answering
Conclusion
Error Analysis and Radiologist Evaluation
...and 2 more sections

Figures (5)

Figure 1: The training process of the PLURAL model. We adopted a Transformer-based network that takes a single image and an instruction as an input (see blue and green boxes) in the first stage. In the second and third stages, we added a new input branch for a past image (see red boxes) to utilize longitudinal chest X-ray data, including chest X-ray reports and difference VQA data.
Figure 2: Proposed VLM architecture used in the second and third training stages of PLURAL. We modified a basic Transformer architecture by adding a new input branch for a past image. This enabled the model to take concurrently two longitudinal chest X-ray images as an input.
Figure 3: Selected test cases to qualitatively compare the outputs of EKAID and PLURAL in diff-VQA. We found that PLURAL was better at capturing the longitudinal change of multiple lung findings (a), (b), and change of the severity level of abnormalities, such as worsening of pleural effusion in the right low part of the lung (c).
Figure 4: Radiologist evaluation of error cases: (a) GT and PLURAL with equal correctness scores; (b) Lower PLURAL score due to reversed severity order.
Figure 5: Selected test cases to qualitatively compare the outputs of EKAID and PLURAL in non-difference VQA. Non-difference VQA contains various categories of QA sets such as type (a), level (b), and location (c).

Pretraining Vision-Language Model for Difference Visual Question Answering in Longitudinal Chest X-rays

TL;DR

Abstract

Pretraining Vision-Language Model for Difference Visual Question Answering in Longitudinal Chest X-rays

Authors

TL;DR

Abstract

Table of Contents

Figures (5)