EditGRPO: Reinforcement Learning with Post-Rollout Edits for Clinically Accurate Chest X-Ray Report Generation
Kai Zhang, Christopher Malon, Lichao Sun, Martin Renqiang Min
TL;DR
EditGRPO tackles the clinical fidelity gap in radiology report generation by introducing a mixed-policy reinforcement learning framework that supplements on-policy rollouts with sentence-level post-rollout edits derived from gold-standard reports. It builds on GRPO by using an unnormalized advantage and a RaTE-NER–based editing rule to insert minimal, similarity-guided corrections, while maintaining proximity to the current policy. Training follows a two-stage regime (SFT for domain adaptation, then RL with a composite reward $R = \mathrm{RadGraph\text{-}F1} + \mathrm{CheXbert\text{-}Micro\text{-}F1\text{-}14} + \mathrm{RaTE}$), and is evaluated on four chest X-ray datasets with multi-view and longitudinal data. Across MIMIC-CXR, RexGradient, and out-of-domain datasets IU-XRay and PadChest-GR, EditGRPO achieves average improvements around 3.4% on clinical metrics and up to 5.9% in out-of-domain settings, demonstrating enhanced clinical efficacy and generalization. This approach shows that sentence-level, similarity-based edits can stabilize RL training and produce more clinically useful radiology reports, offering a practical path to robust medical multimodal generation.
Abstract
Radiology report generation requires advanced medical image analysis, effective temporal reasoning, and accurate text generation. Although recent innovations, particularly multimodal large language models, have shown improved performance, their supervised fine-tuning (SFT) objective is not explicitly aligned with clinical efficacy. In this work, we introduce EditGRPO, a mixed-policy reinforcement learning algorithm designed specifically to optimize the generation through clinically motivated rewards. EditGRPO integrates on-policy exploration with off-policy guidance by injecting sentence-level detailed corrections during training rollouts. This mixed-policy approach addresses the exploration dilemma and sampling efficiency issues typically encountered in RL. Applied to a Qwen2.5-VL-3B, EditGRPO outperforms both SFT and vanilla GRPO baselines, achieving an average improvement of 3.4\% in clinical metrics across four major datasets. Notably, EditGRPO also demonstrates superior out-of-domain generalization, with an average performance gain of 5.9\% on unseen datasets.
