Table of Contents
Fetching ...

OraPO: Oracle-educated Reinforcement Learning for Data-efficient and Factual Radiology Report Generation

Zhuoxiao Chen, Hongyang Yu, Ying Xu, Yadan Luo, Long Duong, Yuan-Fang Li

Abstract

Radiology report generation (RRG) aims to automatically produce clinically faithful reports from chest X-ray images. Prevailing work typically follows a scale-driven paradigm, by multi-stage training over large paired corpora and oversized backbones, making pipelines highly data- and compute-intensive. In this paper, we propose Oracle-educated GRPO (OraPO) with a FactScore-based reward (FactS) to tackle the RRG task under constrained budgets. OraPO enables single-stage, RL-only training by converting failed GRPO explorations on rare or difficult studies into direct preference supervision via a lightweight oracle step. FactS grounds learning in diagnostic evidence by extracting atomic clinical facts and checking entailment against ground-truth labels, yielding dense, interpretable sentence-level rewards. Together, OraPO and FactS create a compact and powerful framework that significantly improves learning efficiency on clinically challenging cases, setting the new SOTA performance on the CheXpert Plus dataset (0.341 in F1) with 2--3 orders of magnitude less training data using a small base VLM on modest hardware.

OraPO: Oracle-educated Reinforcement Learning for Data-efficient and Factual Radiology Report Generation

Abstract

Radiology report generation (RRG) aims to automatically produce clinically faithful reports from chest X-ray images. Prevailing work typically follows a scale-driven paradigm, by multi-stage training over large paired corpora and oversized backbones, making pipelines highly data- and compute-intensive. In this paper, we propose Oracle-educated GRPO (OraPO) with a FactScore-based reward (FactS) to tackle the RRG task under constrained budgets. OraPO enables single-stage, RL-only training by converting failed GRPO explorations on rare or difficult studies into direct preference supervision via a lightweight oracle step. FactS grounds learning in diagnostic evidence by extracting atomic clinical facts and checking entailment against ground-truth labels, yielding dense, interpretable sentence-level rewards. Together, OraPO and FactS create a compact and powerful framework that significantly improves learning efficiency on clinically challenging cases, setting the new SOTA performance on the CheXpert Plus dataset (0.341 in F1) with 2--3 orders of magnitude less training data using a small base VLM on modest hardware.

Paper Structure

This paper contains 18 sections, 15 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Illustration of mainstream data/compute-intensive pipelines (upper-left) versus our data-efficient pipeline (upper-right). Bottom: on CheXpert Plus DBLP:journals/corr/abs-2405-19538, our method achieves the SOTA performance for RRG with less than 0.1% of the training samples (vs. 1.27 M) used by best-performing baselines and a much smaller model, demonstrating strong performance under tight data and compute budgets.
  • Figure 2: Left: Cumulative proportion of zero-reward batches (reward batch mean = 0) vs. training step on CheXpert Plus DBLP:conf/cvpr/WangWLMW0025. OraPO suppresses zero-reward frequency faster than naïve GRPO. Centre/Right: Class-level F1 on the CheXpert Plus validation set DBLP:conf/cvpr/WangWLMW0025 across checkpoints for two clinically challenging and rare classes: Pneumonia (2.70%) and Fracture (4.05%). OraPO learns earlier and maintains higher F1 than naïve GRPO.
  • Figure 3: X-ray image and its corresponding ground-truth, along with the output of our model generation report on the ChexPert Plus dataset. The mismatch sentence in the reports are highlighted using different colors.