Table of Contents
Fetching ...

MedLoc-R1: Performance-Aware Curriculum Reward Scheduling for GRPO-Based Medical Visual Grounding

Guangjing Yang, Ziyuan Qin, Chaoran Zhang, Chenlin Du, Jinlin Wang, Wanran Sun, Zhenyu Zhang, Bing Ji, Qicheng Lao

Abstract

Medical visual grounding serves as a crucial foundation for fine-grained multimodal reasoning and interpretable clinical decision support. Despite recent advances in reinforcement learning (RL) for grounding tasks, existing approaches such as Group Relative Policy Optimization~(GRPO) suffer from severe reward sparsity when directly applied to medical images, primarily due to the inherent difficulty of localizing small or ambiguous regions of interest, which is further exacerbated by the rigid and suboptimal nature of fixed IoU-based reward schemes in RL. This leads to vanishing policy gradients and stagnated optimization, particularly during early training. To address this challenge, we propose MedLoc-R1, a performance-aware reward scheduling framework that progressively tightens the reward criterion in accordance with model readiness. MedLoc-R1 introduces a sliding-window performance tracker and a multi-condition update rule that automatically adjust the reward schedule from dense, easily obtainable signals to stricter, fine-grained localization requirements, while preserving the favorable properties of GRPO without introducing auxiliary networks or additional gradient paths. Experiments on three medical visual grounding benchmarks demonstrate that MedLoc-R1 consistently improves both localization accuracy and training stability over GRPO-based baselines. Our framework offers a general, lightweight, and effective solution for RL-based grounding in high-stakes medical applications. Code \& checkpoints are available at \hyperlink{}{https://github.com/MembrAI/MedLoc-R1}.

MedLoc-R1: Performance-Aware Curriculum Reward Scheduling for GRPO-Based Medical Visual Grounding

Abstract

Medical visual grounding serves as a crucial foundation for fine-grained multimodal reasoning and interpretable clinical decision support. Despite recent advances in reinforcement learning (RL) for grounding tasks, existing approaches such as Group Relative Policy Optimization~(GRPO) suffer from severe reward sparsity when directly applied to medical images, primarily due to the inherent difficulty of localizing small or ambiguous regions of interest, which is further exacerbated by the rigid and suboptimal nature of fixed IoU-based reward schemes in RL. This leads to vanishing policy gradients and stagnated optimization, particularly during early training. To address this challenge, we propose MedLoc-R1, a performance-aware reward scheduling framework that progressively tightens the reward criterion in accordance with model readiness. MedLoc-R1 introduces a sliding-window performance tracker and a multi-condition update rule that automatically adjust the reward schedule from dense, easily obtainable signals to stricter, fine-grained localization requirements, while preserving the favorable properties of GRPO without introducing auxiliary networks or additional gradient paths. Experiments on three medical visual grounding benchmarks demonstrate that MedLoc-R1 consistently improves both localization accuracy and training stability over GRPO-based baselines. Our framework offers a general, lightweight, and effective solution for RL-based grounding in high-stakes medical applications. Code \& checkpoints are available at \hyperlink{}{https://github.com/MembrAI/MedLoc-R1}.

Paper Structure

This paper contains 25 sections, 13 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Reward curve with performance-aware progressive reward scheduling (in red) showing dense values compared to the reward curve with fixed reward scheme (in blue). The dashed line (in green) indicates the progression of the reward criterion.
  • Figure 2: Overview of our proposed MedLoc-R1. We propose a progressive curriculum reward scheduling strategy, driven by tracking performance statistics, including mean reward $\bar{r}_k$, reward std. $\sigma_{r,k}$, and the mean IoU $\bar{m}_k$ assessing localization quality.
  • Figure 3: A@0.5 (%) performance across adjacent training steps on HAM10000, HEEL, and TN3K. Each subplot compares the proposed MedLoc-R1-3B model with two baselines. MedLoc-R1-3B consistently achieves higher A@0.5 and exhibits stronger gains with increasing steps, while V-Triune-3B shows moderate improvement and VLM-R1-3B remains the weakest baseline.
  • Figure 4: Qualitative comparison of our MedLoc-R1 (in red boxes) and fixed-threshold VLM-R1 (in blue boxes) on HEEL and TN3K. Ground truth in green boxes. MedLoc-R1 produces more precise boxes with coherent and semantically rich reasoning.
  • Figure 4: Ablation on Step Size Variants in Piecewise Decay on HAM10000."Identical" refers to a fixed step size $\delta_k = \delta_0$ throughout training.
  • ...and 4 more figures