Table of Contents
Fetching ...

EditGRPO: Reinforcement Learning with Post-Rollout Edits for Clinically Accurate Chest X-Ray Report Generation

Kai Zhang, Christopher Malon, Lichao Sun, Martin Renqiang Min

TL;DR

EditGRPO tackles the clinical fidelity gap in radiology report generation by introducing a mixed-policy reinforcement learning framework that supplements on-policy rollouts with sentence-level post-rollout edits derived from gold-standard reports. It builds on GRPO by using an unnormalized advantage and a RaTE-NER–based editing rule to insert minimal, similarity-guided corrections, while maintaining proximity to the current policy. Training follows a two-stage regime (SFT for domain adaptation, then RL with a composite reward $R = \mathrm{RadGraph\text{-}F1} + \mathrm{CheXbert\text{-}Micro\text{-}F1\text{-}14} + \mathrm{RaTE}$), and is evaluated on four chest X-ray datasets with multi-view and longitudinal data. Across MIMIC-CXR, RexGradient, and out-of-domain datasets IU-XRay and PadChest-GR, EditGRPO achieves average improvements around 3.4% on clinical metrics and up to 5.9% in out-of-domain settings, demonstrating enhanced clinical efficacy and generalization. This approach shows that sentence-level, similarity-based edits can stabilize RL training and produce more clinically useful radiology reports, offering a practical path to robust medical multimodal generation.

Abstract

Radiology report generation requires advanced medical image analysis, effective temporal reasoning, and accurate text generation. Although recent innovations, particularly multimodal large language models, have shown improved performance, their supervised fine-tuning (SFT) objective is not explicitly aligned with clinical efficacy. In this work, we introduce EditGRPO, a mixed-policy reinforcement learning algorithm designed specifically to optimize the generation through clinically motivated rewards. EditGRPO integrates on-policy exploration with off-policy guidance by injecting sentence-level detailed corrections during training rollouts. This mixed-policy approach addresses the exploration dilemma and sampling efficiency issues typically encountered in RL. Applied to a Qwen2.5-VL-3B, EditGRPO outperforms both SFT and vanilla GRPO baselines, achieving an average improvement of 3.4\% in clinical metrics across four major datasets. Notably, EditGRPO also demonstrates superior out-of-domain generalization, with an average performance gain of 5.9\% on unseen datasets.

EditGRPO: Reinforcement Learning with Post-Rollout Edits for Clinically Accurate Chest X-Ray Report Generation

TL;DR

EditGRPO tackles the clinical fidelity gap in radiology report generation by introducing a mixed-policy reinforcement learning framework that supplements on-policy rollouts with sentence-level post-rollout edits derived from gold-standard reports. It builds on GRPO by using an unnormalized advantage and a RaTE-NER–based editing rule to insert minimal, similarity-guided corrections, while maintaining proximity to the current policy. Training follows a two-stage regime (SFT for domain adaptation, then RL with a composite reward ), and is evaluated on four chest X-ray datasets with multi-view and longitudinal data. Across MIMIC-CXR, RexGradient, and out-of-domain datasets IU-XRay and PadChest-GR, EditGRPO achieves average improvements around 3.4% on clinical metrics and up to 5.9% in out-of-domain settings, demonstrating enhanced clinical efficacy and generalization. This approach shows that sentence-level, similarity-based edits can stabilize RL training and produce more clinically useful radiology reports, offering a practical path to robust medical multimodal generation.

Abstract

Radiology report generation requires advanced medical image analysis, effective temporal reasoning, and accurate text generation. Although recent innovations, particularly multimodal large language models, have shown improved performance, their supervised fine-tuning (SFT) objective is not explicitly aligned with clinical efficacy. In this work, we introduce EditGRPO, a mixed-policy reinforcement learning algorithm designed specifically to optimize the generation through clinically motivated rewards. EditGRPO integrates on-policy exploration with off-policy guidance by injecting sentence-level detailed corrections during training rollouts. This mixed-policy approach addresses the exploration dilemma and sampling efficiency issues typically encountered in RL. Applied to a Qwen2.5-VL-3B, EditGRPO outperforms both SFT and vanilla GRPO baselines, achieving an average improvement of 3.4\% in clinical metrics across four major datasets. Notably, EditGRPO also demonstrates superior out-of-domain generalization, with an average performance gain of 5.9\% on unseen datasets.

Paper Structure

This paper contains 30 sections, 4 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: The graphical diagram illustrates the post-rollout-edit technique used in the proposed EditGRPO algorithm. For each rollout, the generated response $o$ is edited based on the gold-standard or reference report $\mathbf{y}$ at the sentence level. This includes replacing incorrect or false positive (FP) sentences $x$. For example, if the reference contains "cardiomegaly" but the generated report states "the heart is within normal limits," the incorrect sentence is replaced. Additionally, missing findings, referred to as false negatives (FN), can be added based on the reference report.
  • Figure 2: Performance (%) of different training strategies on two small-scale datasets: IU-XRay and PadChest-GR.
  • Figure 3: Influence of reward design on the IU-XRay dataset under the SFT + Dr.GRPO setting. RG denotes the RadGraph reward, and IF denotes the inverse‑frequency reward, which assigns higher scores when a rare condition is hit according to the label distribution across the 14 CheXpert classes chexpert of the training data.
  • Figure 4: Reward gains (RadGraph + RaTE + Chexbert-Micro-F1-14, the maximum is 3) over training step on MIMIC-CXR.