Table of Contents
Fetching ...

Toward Clinically Explainable AI for Medical Diagnosis: A Foundation Model with Human-Compatible Reasoning via Reinforcement Learning

Qika Lin, Yifan Zhu, Bin Pu, Ling Huang, Haoran Luo, Jingying Ma, Feng Wu, Kai He, Jiaxing Xu, Zhen Peng, Tianzhe Zhao, Fangzhi Xu, Jian Zhang, Zhonghong Ou, Erik Cambria, Swapnil Mishra, Mengling Feng

TL;DR

DeepMedix-R1 is introduced, a foundation model for chest X-ray interpretation that generates not only accurate diagnoses but also a transparent, step-by-step reasoning process grounded in specific visual evidence, confirming its superior interpretability and clinical utility.

Abstract

The clinical adoption of artificial intelligence (AI) in medical diagnostics is critically hampered by its black-box nature, which prevents clinicians from verifying the rationale behind automated decisions. To overcome this fundamental barrier, we introduce DeepMedix-R1, a foundation model (FM) for chest X-ray (CXR) interpretation that generates not only accurate diagnoses but also a transparent, step-by-step reasoning process grounded in specific visual evidence. Our methodology employs a sequential training strategy, beginning with instruction fine-tuning, followed by a cold-start phase to elicit reasoning capabilities. Critically, we then implement reinforcement learning with grounded rewards to meticulously refine the model, aligning both its diagnostic outputs and its reasoning pathways with clinical plausibility. Quantitative assessments show that DeepMedix-R1 substantially outperforms advanced FMs, achieving improvements in report generation and visual question answering tasks. We also introduce Report Arena, a novel LLM-based benchmark that ranks DeepMedix-R1 first among competing models for output quality. Most significantly, a formal review by clinical experts reveals a profound preference for DeepMedix-R1's generated reasoning over the broadly adopted Qwen2.5-VL-7B model, confirming its superior interpretability and clinical utility.

Toward Clinically Explainable AI for Medical Diagnosis: A Foundation Model with Human-Compatible Reasoning via Reinforcement Learning

TL;DR

DeepMedix-R1 is introduced, a foundation model for chest X-ray interpretation that generates not only accurate diagnoses but also a transparent, step-by-step reasoning process grounded in specific visual evidence, confirming its superior interpretability and clinical utility.

Abstract

The clinical adoption of artificial intelligence (AI) in medical diagnostics is critically hampered by its black-box nature, which prevents clinicians from verifying the rationale behind automated decisions. To overcome this fundamental barrier, we introduce DeepMedix-R1, a foundation model (FM) for chest X-ray (CXR) interpretation that generates not only accurate diagnoses but also a transparent, step-by-step reasoning process grounded in specific visual evidence. Our methodology employs a sequential training strategy, beginning with instruction fine-tuning, followed by a cold-start phase to elicit reasoning capabilities. Critically, we then implement reinforcement learning with grounded rewards to meticulously refine the model, aligning both its diagnostic outputs and its reasoning pathways with clinical plausibility. Quantitative assessments show that DeepMedix-R1 substantially outperforms advanced FMs, achieving improvements in report generation and visual question answering tasks. We also introduce Report Arena, a novel LLM-based benchmark that ranks DeepMedix-R1 first among competing models for output quality. Most significantly, a formal review by clinical experts reveals a profound preference for DeepMedix-R1's generated reasoning over the broadly adopted Qwen2.5-VL-7B model, confirming its superior interpretability and clinical utility.

Paper Structure

This paper contains 24 sections, 7 equations, 13 figures, 6 tables.

Figures (13)

  • Figure 1: The illustration of a step-by-step reasoning process grounded in specific visual regions for diagnostic decision-making.
  • Figure 2: The overall experimental results. (a), (b) and (c) are the results for model comparison on report generation, VQA, and external CXR14 dataset, respectively, demonstrating DeepMedix-R1 outperforms strong open-source FMs in both general and medical domains.
  • Figure 3: Overall results for report generation tasks, representing the weighted average performance across all datasets.
  • Figure 4: The detailed CheXbert-F1 scores on top-10 supported observations on MIMIC-CXR findings and OPEN-I findings, respectively. "EC" is short for "Enlarged Cardiomediastinum".
  • Figure 5: The results of Report Arena. (a) and (b) present the numerical matrices for pairwise model comparisons and corresponding win rates in our Report Arena evaluation framework, respectively. (c) comparatively visualizes each model's performance across both conventional automated metrics and Report Arena rankings, with circle sizes proportionally representing the number of model parameters.
  • ...and 8 more figures