Table of Contents
Fetching ...

Rethinking the Efficiency and Effectiveness of Reinforcement Learning for Radiology Report Generation

Zilin Lu, Ruifeng Yuan, Weiwei Cao, Wanxing Chang, Zhongyu Wei, Sinuo Wang, Yong Xia, Ling Zhang, Jianpeng Zhang

TL;DR

This paper revisits RL in terms of data efficiency and optimization effectiveness for R2G tasks, and introduces Diagnostic Token-weighted Policy Optimization (DiTPO), which directly optimizes for clinical accuracy by using a diagnostic F1 score as the reward signal.

Abstract

Radiologists highly desire fully automated AI for radiology report generation (R2G), yet existing approaches fall short in clinical utility. Reinforcement learning (RL) holds potential to address these shortcomings, but its adoption in this task remains underexplored. In this paper, we revisit RL in terms of data efficiency and optimization effectiveness for R2G tasks. First, we explore the impact of data quantity and quality on the performance of RL in medical contexts, revealing that data quality plays a more critical role than quantity. To this end, we propose a diagnostic diversity-based data sampling strategy that enables comparable performance with fewer samples. Second, we observe that the majority of tokens in radiology reports are template-like and diagnostically uninformative, whereas the low frequency of clinically critical tokens heightens the risk of being overlooked during optimization. To tackle this, we introduce Diagnostic Token-weighted Policy Optimization (DiTPO), which directly optimizes for clinical accuracy by using a diagnostic F1 score as the reward signal. Unlike standard RL approaches that treat all tokens equally, DiTPO explicitly models the varying importance of different tokens through rule- or gradient-based mechanisms to prioritize clinically relevant content. Extensive experiments on the MIMIC-CXR, IU-Xray, and CheXpert Plus datasets demonstrate that our framework achieves state-of-the-art (SOTA) performance while requiring substantially fewer training samples in RL. Notably, on MIMIC-CXR, our framework attains an F1 score of 0.516 using only 20% of the RL training samples.

Rethinking the Efficiency and Effectiveness of Reinforcement Learning for Radiology Report Generation

TL;DR

This paper revisits RL in terms of data efficiency and optimization effectiveness for R2G tasks, and introduces Diagnostic Token-weighted Policy Optimization (DiTPO), which directly optimizes for clinical accuracy by using a diagnostic F1 score as the reward signal.

Abstract

Radiologists highly desire fully automated AI for radiology report generation (R2G), yet existing approaches fall short in clinical utility. Reinforcement learning (RL) holds potential to address these shortcomings, but its adoption in this task remains underexplored. In this paper, we revisit RL in terms of data efficiency and optimization effectiveness for R2G tasks. First, we explore the impact of data quantity and quality on the performance of RL in medical contexts, revealing that data quality plays a more critical role than quantity. To this end, we propose a diagnostic diversity-based data sampling strategy that enables comparable performance with fewer samples. Second, we observe that the majority of tokens in radiology reports are template-like and diagnostically uninformative, whereas the low frequency of clinically critical tokens heightens the risk of being overlooked during optimization. To tackle this, we introduce Diagnostic Token-weighted Policy Optimization (DiTPO), which directly optimizes for clinical accuracy by using a diagnostic F1 score as the reward signal. Unlike standard RL approaches that treat all tokens equally, DiTPO explicitly models the varying importance of different tokens through rule- or gradient-based mechanisms to prioritize clinically relevant content. Extensive experiments on the MIMIC-CXR, IU-Xray, and CheXpert Plus datasets demonstrate that our framework achieves state-of-the-art (SOTA) performance while requiring substantially fewer training samples in RL. Notably, on MIMIC-CXR, our framework attains an F1 score of 0.516 using only 20% of the RL training samples.
Paper Structure (17 sections, 12 equations, 3 figures, 7 tables)

This paper contains 17 sections, 12 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: (a) Diagnostic F1 scores of GRPO and DEER with different proportions of RL training data. The results indicate that with only about 20% of the RL training data, the models'performance is comparable to that achieved with 100% of the data. (b) Comparison between uniform GRPO and our DiTPO. GRPO assigns equal advantages to all tokens regardless of their clinical importance, while DiTPO assigns significantly higher advantages to diagnostically critical tokens.
  • Figure 2: The DEER framework consists of three stages: (1) SFT for cold‑start initialization to provide foundational R2G capabilities; (2) Data Selection via DDSampling to retain high‑quality diverse samples; and (3) DiTPO, where rule‑based (TF‑IDF) and gradient‑based token weighting are integrated to produce diagnosis‑aware token‑level advantages for policy optimization.
  • Figure 3: Diagnostic token weighting case study. Sentences describing the same clinical finding in both the reference and generated reports are highlighted with matching colors. The generated report uses red color intensity to visualize token-level weights.