Table of Contents
Fetching ...

Interpreting Chest X-rays Like a Radiologist: A Benchmark with Clinical Reasoning

Jinquan Guan, Qi Chen, Lizhou Liang, Yuhang Liu, Vu Minh Hieu Phan, Minh-Son To, Jian Chen, Yutong Xie

TL;DR

This work introduces CXRTrek, a large-scale multi-stage benchmark that models radiologist-like clinical reasoning for chest X-ray interpretation, and CXRTrekNet, a vision-language large model tailored to follow eight sequential diagnostic stages. The dataset (428,966 CXRs with over 11 million QA pairs) enables fine-grained supervision and stage-wise evaluation, while the model integrates frozen encoders with an autoregressive, context-aware LLM and parameter-efficient fine-tuning. Across the CXRTrek benchmark and five external datasets, CXRTrekNet achieves superior performance and demonstrates strong generalization across classification, detection, VQA, and report generation tasks. The work emphasizes the importance of explicit clinical reasoning flow for interpretability and reliability, and it provides a path toward more trustworthy AI assistants in radiology.

Abstract

Artificial intelligence (AI)-based chest X-ray (CXR) interpretation assistants have demonstrated significant progress and are increasingly being applied in clinical settings. However, contemporary medical AI models often adhere to a simplistic input-to-output paradigm, directly processing an image and an instruction to generate a result, where the instructions may be integral to the model's architecture. This approach overlooks the modeling of the inherent diagnostic reasoning in chest X-ray interpretation. Such reasoning is typically sequential, where each interpretive stage considers the images, the current task, and the contextual information from previous stages. This oversight leads to several shortcomings, including misalignment with clinical scenarios, contextless reasoning, and untraceable errors. To fill this gap, we construct CXRTrek, a new multi-stage visual question answering (VQA) dataset for CXR interpretation. The dataset is designed to explicitly simulate the diagnostic reasoning process employed by radiologists in real-world clinical settings for the first time. CXRTrek covers 8 sequential diagnostic stages, comprising 428,966 samples and over 11 million question-answer (Q&A) pairs, with an average of 26.29 Q&A pairs per sample. Building on the CXRTrek dataset, we propose a new vision-language large model (VLLM), CXRTrekNet, specifically designed to incorporate the clinical reasoning flow into the VLLM framework. CXRTrekNet effectively models the dependencies between diagnostic stages and captures reasoning patterns within the radiological context. Trained on our dataset, the model consistently outperforms existing medical VLLMs on the CXRTrek benchmarks and demonstrates superior generalization across multiple tasks on five diverse external datasets. The dataset and model can be found in our repository (https://github.com/guanjinquan/CXRTrek).

Interpreting Chest X-rays Like a Radiologist: A Benchmark with Clinical Reasoning

TL;DR

This work introduces CXRTrek, a large-scale multi-stage benchmark that models radiologist-like clinical reasoning for chest X-ray interpretation, and CXRTrekNet, a vision-language large model tailored to follow eight sequential diagnostic stages. The dataset (428,966 CXRs with over 11 million QA pairs) enables fine-grained supervision and stage-wise evaluation, while the model integrates frozen encoders with an autoregressive, context-aware LLM and parameter-efficient fine-tuning. Across the CXRTrek benchmark and five external datasets, CXRTrekNet achieves superior performance and demonstrates strong generalization across classification, detection, VQA, and report generation tasks. The work emphasizes the importance of explicit clinical reasoning flow for interpretability and reliability, and it provides a path toward more trustworthy AI assistants in radiology.

Abstract

Artificial intelligence (AI)-based chest X-ray (CXR) interpretation assistants have demonstrated significant progress and are increasingly being applied in clinical settings. However, contemporary medical AI models often adhere to a simplistic input-to-output paradigm, directly processing an image and an instruction to generate a result, where the instructions may be integral to the model's architecture. This approach overlooks the modeling of the inherent diagnostic reasoning in chest X-ray interpretation. Such reasoning is typically sequential, where each interpretive stage considers the images, the current task, and the contextual information from previous stages. This oversight leads to several shortcomings, including misalignment with clinical scenarios, contextless reasoning, and untraceable errors. To fill this gap, we construct CXRTrek, a new multi-stage visual question answering (VQA) dataset for CXR interpretation. The dataset is designed to explicitly simulate the diagnostic reasoning process employed by radiologists in real-world clinical settings for the first time. CXRTrek covers 8 sequential diagnostic stages, comprising 428,966 samples and over 11 million question-answer (Q&A) pairs, with an average of 26.29 Q&A pairs per sample. Building on the CXRTrek dataset, we propose a new vision-language large model (VLLM), CXRTrekNet, specifically designed to incorporate the clinical reasoning flow into the VLLM framework. CXRTrekNet effectively models the dependencies between diagnostic stages and captures reasoning patterns within the radiological context. Trained on our dataset, the model consistently outperforms existing medical VLLMs on the CXRTrek benchmarks and demonstrates superior generalization across multiple tasks on five diverse external datasets. The dataset and model can be found in our repository (https://github.com/guanjinquan/CXRTrek).

Paper Structure

This paper contains 62 sections, 4 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: The clinical reasoning flow in CXRTrek through a multi-stage Q&A flow. Each Q&A pair represents a task for a specific stage of clinical CXRs interpretation, where each stage may involve multiple Q&A pairs for a thorough analysis of its designated responsibilities. The stages progress from image validation to report summarization, simulating the sequential reasoning of radiologists.
  • Figure 2: Frequency analysis of word-pairs and task distributions in the training set of CXRTrek. (a) Instruction word-pairs generated using GPT-4achiam2023gpt highlight dominant terms in questions ($e.g.$, "identify") . (b) Response word-pairs show common diagnostic observations ($e.g.$, "present", "opacity"). (c) Distribution of the number of Q&A pairs across 8 stages. (d)-(f) Distribution of Q&A pairs per sample, question/answer length statistics, and proportions of the four response formats in CXRTrek. Normalized frequency is the item count divided by the total count across all items.
  • Figure 3: Overview of CXRTrekNet. It takes as input one or more chest X-ray images and a sequence of clinically guided questions. At each stage, it generates an answer by encoding the images through a vision encoder and the Q&A history through a text encoder. These features are fused by a fine-tuned LLM to generate the current answer. The output is then appended to the context to inform subsequent stages, enabling progressive multi-stage reasoning that mimics radiologist workflows.
  • Figure 4: Two radar charts comparing stage-wise performance. Stage scores average corresponding metrics across Open-Ended, Close-Ended, Choice, and Detection questions.
  • Figure A: Distribution of training samples, illustrating the proportional contribution of each dataset to the training mixture.
  • ...and 4 more figures