Table of Contents
Fetching ...

A Reasoning-Enabled Vision-Language Foundation Model for Chest X-ray Interpretation

Yabin Zhang, Chong Wang, Yunhe Gao, Jiaming Liu, Maya Varma, Justin Xu, Sophie Ostmeier, Jin Long, Sergios Gatidis, Seena Dehkharghani, Arne Michalson, Eun Kyoung Hong, Christian Bluethgen, Haiwei Henry Guo, Alexander Victor Ortiz, Stephan Altmayer, Sandhya Bodapati, Joseph David Janizek, Ken Chang, Jean-Benoit Delbrouck, Akshay S. Chaudhari, Curtis P. Langlotz

Abstract

Chest X-rays (CXRs) are among the most frequently performed imaging examinations worldwide, yet rising imaging volumes increase radiologist workload and the risk of diagnostic errors. Although artificial intelligence (AI) systems have shown promise for CXR interpretation, most generate only final predictions, without making explicit how visual evidence is translated into radiographic findings and diagnostic predictions. We present CheXOne, a reasoning-enabled vision-language model for CXR interpretation. CheXOne jointly generates diagnostic predictions and explicit, clinically grounded reasoning traces that connect visual evidence, radiographic findings, and these predictions. The model is trained on 14.7 million instruction and reasoning samples curated from 30 public datasets spanning 36 CXR interpretation tasks, using a two-stage framework that combines instruction tuning with reinforcement learning to improve reasoning quality. We evaluate CheXOne in zero-shot settings across visual question answering, report generation, visual grounding and reasoning assessment, covering 17 evaluation settings. CheXOne outperforms existing medical and general-domain foundation models and achieves strong performance on independent public benchmarks. A clinical reader study demonstrates that CheXOne-drafted reports are comparable to or better than resident-written reports in 55% of cases, while effectively addressing clinical indications and enhancing both report writing and CXR interpretation efficiency. Further analyses involving radiologists reveal that the generated reasoning traces show high clinical factuality and provide causal support for the final predictions, offering a plausible explanation for the performance gains. These results suggest that explicit reasoning can improve model performance, interpretability and clinical utility in AI-assisted CXR interpretation.

A Reasoning-Enabled Vision-Language Foundation Model for Chest X-ray Interpretation

Abstract

Chest X-rays (CXRs) are among the most frequently performed imaging examinations worldwide, yet rising imaging volumes increase radiologist workload and the risk of diagnostic errors. Although artificial intelligence (AI) systems have shown promise for CXR interpretation, most generate only final predictions, without making explicit how visual evidence is translated into radiographic findings and diagnostic predictions. We present CheXOne, a reasoning-enabled vision-language model for CXR interpretation. CheXOne jointly generates diagnostic predictions and explicit, clinically grounded reasoning traces that connect visual evidence, radiographic findings, and these predictions. The model is trained on 14.7 million instruction and reasoning samples curated from 30 public datasets spanning 36 CXR interpretation tasks, using a two-stage framework that combines instruction tuning with reinforcement learning to improve reasoning quality. We evaluate CheXOne in zero-shot settings across visual question answering, report generation, visual grounding and reasoning assessment, covering 17 evaluation settings. CheXOne outperforms existing medical and general-domain foundation models and achieves strong performance on independent public benchmarks. A clinical reader study demonstrates that CheXOne-drafted reports are comparable to or better than resident-written reports in 55% of cases, while effectively addressing clinical indications and enhancing both report writing and CXR interpretation efficiency. Further analyses involving radiologists reveal that the generated reasoning traces show high clinical factuality and provide causal support for the final predictions, offering a plausible explanation for the performance gains. These results suggest that explicit reasoning can improve model performance, interpretability and clinical utility in AI-assisted CXR interpretation.

Paper Structure

This paper contains 25 sections, 6 equations, 15 figures, 4 tables.

Figures (15)

  • Figure 1: Training data. a, Construction of the CheXinstruct-v2 dataset from 30 public datasets, covering 36 CXR interpretation tasks and 10.2 million instruction samples. b, Generation of the CheXReason dataset, comprising over 4.5 million LLM-generated reasoning traces. c, Illustration of training data example. d, Overview of the training data.
  • Figure 2: Training and inference workflow.a, Initial instruction tuning. A pre-trained VLM undergoes instruction tuning using the CheXinstruct-v2 and CheXReason datasets to establish foundational CXR interpretation and reasoning capabilities. b, Reasoning enhancement via Reinforcement Learning. The model's reasoning logic is further refined using Group Relative Policy Optimization (GRPO), guided by task-specific reward functions. c, Multi-task zero-shot inference, CheXOne is evaluated across 17 subtasks within four categories. Performance is quantified using domain-specific metrics: accuracy for VQA, 1/RadCliQ for report generation, IoU for visual grounding, and specialized scores for factuality ($S_f$) and self-consistency ($S_{sc}$).
  • Figure 3: Technical evaluation of VQA across eight radiological skills, where bar graphs show mean accuracy with 95% confidence intervals. a, Performance of presence assessment on the ReXVQA dataset. b, Performance of anatomical localization on the ReXVQA dataset. c, Performance of negation detection on the ReXVQA dataset. d, Performance of differential diagnosis on the ReXVQA dataset. e, Performance of geometric reasoning on the ReXVQA dataset. f, Performance of view classification on the MIMIC-CXR dataset. g, Performance of temporal classification on the Chest ImaGenome dataset. h, Performance of long-tail disease identification on the MIMIC-CXR Long-tail dataset. These diseases were excluded from explicit training, serving as an OOD task to evaluate model generalization.
  • Figure 4: Technical evaluation on report generation.a, Findings generation performance on the public ReXRank benchmark, evaluated over ReXGradient, MIMIC-CXR, CheXpert Plus, and IU Xray datasets. Notably, the IU Xray dataset was not included in the training data and therefore serves to assess generalization to an unseen data distribution. b, Progression generation performance evaluated on the MIMIC-CXR dataset, where models are asked to generate the Findings section with comparison to a previous study.
  • Figure 5: Technical evaluation of visual grounding tasks. Performance is quantified using mean intersection-over-union (mIoU) and mean average precision (mAP) with $95\%$ confidence intervals (CIs). Qualitative examples compare CheXOne's predicted bounding boxes with expert-annotated ground truth. a, Phrase Grounding. Evaluation conducted on the MS-CXR dataset. b, Abnormality Grounding. Evaluation conducted on the VinDr-CXR dataset.
  • ...and 10 more figures