A Vision-Language Foundation Model to Enhance Efficiency of Chest X-ray Interpretation

Zhihong Chen; Maya Varma; Justin Xu; Magdalini Paschali; Dave Van Veen; Andrew Johnston; Alaa Youssef; Louis Blankemeier; Christian Bluethgen; Stephan Altmayer; Jeya Maria Jose Valanarasu; Mohamed Siddig Eltayeb Muneer; Eduardo Pontes Reis; Joseph Paul Cohen; Cameron Olsen; Tanishq Mathew Abraham; Emily B. Tsai; Christopher F. Beaulieu; Jenia Jitsev; Sergios Gatidis; Jean-Benoit Delbrouck; Akshay S. Chaudhari; Curtis P. Langlotz

A Vision-Language Foundation Model to Enhance Efficiency of Chest X-ray Interpretation

Zhihong Chen, Maya Varma, Justin Xu, Magdalini Paschali, Dave Van Veen, Andrew Johnston, Alaa Youssef, Louis Blankemeier, Christian Bluethgen, Stephan Altmayer, Jeya Maria Jose Valanarasu, Mohamed Siddig Eltayeb Muneer, Eduardo Pontes Reis, Joseph Paul Cohen, Cameron Olsen, Tanishq Mathew Abraham, Emily B. Tsai, Christopher F. Beaulieu, Jenia Jitsev, Sergios Gatidis, Jean-Benoit Delbrouck, Akshay S. Chaudhari, Curtis P. Langlotz

TL;DR

Radiology faces a high-volume chest X-ray interpretation burden. The paper introduces CheXagent, a vision-language foundation model trained on CheXinstruct, a large-scale multi-task CXR dataset, and evaluates it with the CheXbench benchmark. It demonstrates strong performance across image perception, image-text reasoning, and text generation tasks, outperforming several baselines and even some proprietary models. A clinical reader study indicates significant efficiency gains for residents without compromising report quality, underscoring practical potential for integrating FMs into routine radiology workflows. The work establishes CheXinstruct and CheXbench as resources to advance robust, generalizable CXR AI systems and points toward future autonomous clinical copilots.

Abstract

Over 1.4 billion chest X-rays (CXRs) are performed annually due to their cost-effectiveness as an initial diagnostic test. This scale of radiological studies provides a significant opportunity to streamline CXR interpretation and documentation. While foundation models are a promising solution, the lack of publicly available large-scale datasets and benchmarks inhibits their iterative development and real-world evaluation. To overcome these challenges, we constructed a large-scale dataset (CheXinstruct), which we utilized to train a vision-language foundation model (CheXagent). We systematically demonstrated competitive performance across eight distinct task types on our novel evaluation benchmark (CheXbench). Beyond technical validation, we assessed the real-world utility of CheXagent in directly drafting radiology reports. Our clinical assessment with eight radiologists revealed a 36% time saving for residents using CheXagent-drafted reports, while attending radiologists showed no significant time difference editing resident-drafted or CheXagent-drafted reports. The CheXagent-drafted reports improved the writing efficiency of both radiology residents and attending radiologists in 81% and 61% of cases, respectively, without loss of quality. Overall, we demonstrate that CheXagent can effectively perform a variety of CXR interpretation tasks and holds potential to assist radiologists in routine clinical workflows.

A Vision-Language Foundation Model to Enhance Efficiency of Chest X-ray Interpretation

TL;DR

Abstract

Paper Structure (6 sections, 8 figures, 1 table)

This paper contains 6 sections, 8 figures, 1 table.

Task Collection.
Source Dataset Collection.
CheXinstruct Compilation.
Image Perception.
Image-Text Reasoning.
Text Generation.

Figures (8)

Figure 1: Curation of CheXinstruct. a, Identification of CXR interpretation tasks. We defined 35 tasks that users are likely to perform with CXR FMs. b, Source dataset collection. To create training data samples for each of our defined tasks, we collected 32 public datasets. c, Data engineering. We performed both manual quality control and automated data engineering to preprocess collected source data. d, CheXinstruct compilation. We used the preprocessed datasets to generate training samples for each of our 35 defined tasks. e, Overview of CheXinstruct with data statistics.
Figure 1: Qualitative analysis of three cases from the reader study. Blue text represents accurate findings in CheXagent-drafted reports, red text represents false predictions in CheXagent-drafted reports, and green text represents findings missed by CheXagent. a, An example case where a radiologist found the CheXagent-drafted report to improve both interpretation and writing efficiencies. Here, CheXagent identified all four devices in the CXR study, enabling the radiologist to efficiently generate the final report. b, An example case where a radiologist found the CheXagent-drafted report to improve writing efficiency. Here, CheXagent accurately predicts the majority of the findings, and the radiologist reorganized and edited the report in his preferred style. c, An example case where a radiologist found the CheXagent-drafted report to not improve efficiency. Here, CheXagent missed a finding (left pleural effusion) in the CXR study.
Figure 2: Training and evaluating CheXagent. a, To develop CheXagent, we first trained a language model on clinical text. b, We then trained an image encoder to learn useful visual representations of imaging findings by leveraging paired text. c, This procedure enabled the visual encoder to capture semantic meaning with respect to key findings within its latent representation space. d, Finally, we jointly trained the image encoder and language model on data triplets from CheXinstruct, providing CheXagent with the capability to respond to user instructions. e, We constructed eight evaluation tasks to assess image perception, reasoning, and text generation capabilities.
Figure 2: Technical evaluation on more FMs. We compared CheXagent with BLIP-2li2023blip2, InstructBLIPdai2023instructblip, MedFlamingomoor2023med, and XrayGPTthawkar2023xraygpt. a, Performance of FMs on view classification. Bar graphs show mean accuracy with 95% confidence intervals. b, Performance of FMs on disease identification with three subtasks. Bar graphs show mean accuracy with 95% confidence intervals. c, Performance of FMs on visual question answering. The bar graph shows mean accuracy with 95% confidence intervals. d, Performance of FMs on fine-grained reasoning. Bar graphs show mean accuracy with 95% confidence intervals.
Figure 3: Technical evaluation on image perception tasks. a, Performance of FMs on view classification. Bar graphs show mean accuracy with 95% confidence intervals. Confusion matrices compare predictions of CheXagent and GPT-4V. b, Performance of FMs on disease identification with three subtasks. Bar graphs show mean accuracy with 95% confidence intervals. Evaluations on OpenI, which was unseen during CheXagent training, assess generalization capabilities. c, Performance of FMs on temporal classification. The bar graph shows mean accuracy with 95% confidence intervals. We provide one example of a prediction generated by CheXagent on the temporal classification task.
...and 3 more figures

A Vision-Language Foundation Model to Enhance Efficiency of Chest X-ray Interpretation

TL;DR

Abstract

A Vision-Language Foundation Model to Enhance Efficiency of Chest X-ray Interpretation

Authors

TL;DR

Abstract

Table of Contents

Figures (8)