Table of Contents
Fetching ...

Reasoning Visual Language Model for Chest X-Ray Analysis

Andriy Myronenko, Dong Yang, Baris Turkbey, Mariam Aboian, Sena Azamat, Esra Akcicek, Hongxu Yin, Pavlo Molchanov, Marc Edgar, Yufan He, Pengfei Guo, Yucheng Tang, Daguang Xu

TL;DR

This work addresses the opacity of vision-language models in medical imaging by introducing a reasoning-first approach for chest X-ray analysis. It combines radiologist-style supervised fine-tuning with GRPO reinforcement learning guided by verifiable, set-level rewards over chest X-ray abnormalities to produce explicit, auditable chain-of-thought reasoning alongside structured impressions. The method yields high-quality reasoning traces that improve clinician trust and efficiency in reader studies while delivering competitive multi-label performance on out-of-distribution data. By releasing NV-Reason-CXR-3B and training code, the authors promote trustworthy, explainable AI that supports safer human–AI collaboration in radiology.

Abstract

Vision-language models (VLMs) have shown strong promise for medical image analysis, but most remain opaque, offering predictions without the transparent, stepwise reasoning clinicians rely on. We present a framework that brings chain-of-thought (CoT) reasoning to chest X-ray interpretation. Inspired by reasoning-first training paradigms, our approach is designed to learn how experts reason, not just what they conclude, by aligning intermediate steps with observable image evidence and radiology workflow. Beyond accuracy, the explicit reasoning traces support clinical auditability: they reveal why a conclusion was reached, which alternatives were considered, and where uncertainty remains, enabling quality assurance, error analysis, and safer human-AI collaboration. Our model couples high-fidelity visual encoding with a two-stage training recipe: a reasoning-style supervised fine-tuning (SFT) followed by reinforcement learning (RL) that uses verifiable rewards over a list of X-ray abnormalities. The model outputs reasoning that mirrors radiologists systematic thought process, uncertainty, and differential diagnosis. In out-of-distribution evaluation, the approach achieves competitive multi-label classification while improving interpretability. In a reader study with expert radiologists, full reasoning traces increased confidence, supported error auditing, and reduced time to finalize reports. We release code and the model NV-Reason-CXR-3B to support community progress toward trustworthy, explainable AI in chest radiography and other medical imaging tasks where reasoning quality is as critical as prediction quality.

Reasoning Visual Language Model for Chest X-Ray Analysis

TL;DR

This work addresses the opacity of vision-language models in medical imaging by introducing a reasoning-first approach for chest X-ray analysis. It combines radiologist-style supervised fine-tuning with GRPO reinforcement learning guided by verifiable, set-level rewards over chest X-ray abnormalities to produce explicit, auditable chain-of-thought reasoning alongside structured impressions. The method yields high-quality reasoning traces that improve clinician trust and efficiency in reader studies while delivering competitive multi-label performance on out-of-distribution data. By releasing NV-Reason-CXR-3B and training code, the authors promote trustworthy, explainable AI that supports safer human–AI collaboration in radiology.

Abstract

Vision-language models (VLMs) have shown strong promise for medical image analysis, but most remain opaque, offering predictions without the transparent, stepwise reasoning clinicians rely on. We present a framework that brings chain-of-thought (CoT) reasoning to chest X-ray interpretation. Inspired by reasoning-first training paradigms, our approach is designed to learn how experts reason, not just what they conclude, by aligning intermediate steps with observable image evidence and radiology workflow. Beyond accuracy, the explicit reasoning traces support clinical auditability: they reveal why a conclusion was reached, which alternatives were considered, and where uncertainty remains, enabling quality assurance, error analysis, and safer human-AI collaboration. Our model couples high-fidelity visual encoding with a two-stage training recipe: a reasoning-style supervised fine-tuning (SFT) followed by reinforcement learning (RL) that uses verifiable rewards over a list of X-ray abnormalities. The model outputs reasoning that mirrors radiologists systematic thought process, uncertainty, and differential diagnosis. In out-of-distribution evaluation, the approach achieves competitive multi-label classification while improving interpretability. In a reader study with expert radiologists, full reasoning traces increased confidence, supported error auditing, and reduced time to finalize reports. We release code and the model NV-Reason-CXR-3B to support community progress toward trustworthy, explainable AI in chest radiography and other medical imaging tasks where reasoning quality is as critical as prediction quality.

Paper Structure

This paper contains 44 sections, 8 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: The results (mean/std) of Accuracy and Reasoning Quality survey. Expert radiologists were tasked to write a report given the AI full reasoning output. Likert scale: 1 - Strongly Disagree, 2 - Disagree, 3 - Neither Agree Nor Disagree, 4 - Agree, 5 - Strongly Agree. The AI-assisted results demonstrated high average Likert scores in the Accuracy and Reasoning Quality evaluation.
  • Figure 2: Average time spent interpreting a chest X-ray and submitting a report. When the AI’s reasoning text and a pre-populated structured report were provided, we observed substantial time savings—especially for abnormal cases.
  • Figure 3: The Time & Efficiency survey (mean/std) was evaluated under two AI-assisted scenarios: (i) Full Reasoning (complete rationale plus findings) and (ii) Labels-only (findings list without explanatory text). Likert scale: 1 - Strongly Disagree, 2 - Disagree, 3 - Neither Agree Nor Disagree, 4 - Agree, 5 - Strongly Agree. In nearly all items, the Full Reasoning condition achieved high Likert scores, indicating substantial time savings when writing chest X-ray reports. By contrast, the Labels-only condition scored low, with negligible benefit for time savings.
  • Figure 4: The Trust & Confidence survey (mean/std) was evaluated under two AI-assisted scenarios: (i) Full Reasoning (complete rationale plus findings) and (ii) Labels-only (findings list without explanatory text). Likert scale: 1 - Strongly Disagree, 2 - Disagree, 3 - Neither Agree Nor Disagree, 4 - Agree, 5 - Strongly Agree. In nearly all items, the Full Reasoning condition achieved high Likert scores, indicating high level of trust and benefits of AI-assisted results.