Table of Contents
Fetching ...

Alleviating Sparse Rewards by Modeling Step-Wise and Long-Term Sampling Effects in Flow-Based GRPO

Yunze Tong, Mushui Liu, Canyu Zhao, Wanggui He, Shiyi Zhang, Hongwei Zhang, Peng Zhang, Jinlong Liu, Ju Huang, Jiamang Wang, Hao Jiang, Pipei Huang

TL;DR

This work tackles reward sparsity and ignored within-trajectory dependencies in flow-based GRPO for diffusion-like image generation. It introduces TurningPoint-GRPO, which replaces terminal rewards with step-wise increments $r_t$ and identifies turning points via sign changes in incremental rewards to assign aggregated long-term rewards $r_t^{\text{agg}}$ that reflect delayed impacts. A consistent turning-point variant and an initial-step long-term effect scheme further refine credit assignment, enabling efficient, hyperparameter-free detection of turning points. Empirically, TP-GRPO improves generation quality across three diverse tasks and demonstrates faster convergence and robust performance relative to Flow-GRPO, illustrating the practical value of explicit implicit-interaction modeling in flow-based RL for vision tasks.

Abstract

Deploying GRPO on Flow Matching models has proven effective for text-to-image generation. However, existing paradigms typically propagate an outcome-based reward to all preceding denoising steps without distinguishing the local effect of each step. Moreover, current group-wise ranking mainly compares trajectories at matched timesteps and ignores within-trajectory dependencies, where certain early denoising actions can affect later states via delayed, implicit interactions. We propose TurningPoint-GRPO (TP-GRPO), a GRPO framework that alleviates step-wise reward sparsity and explicitly models long-term effects within the denoising trajectory. TP-GRPO makes two key innovations: (i) it replaces outcome-based rewards with step-level incremental rewards, providing a dense, step-aware learning signal that better isolates each denoising action's "pure" effect, and (ii) it identifies turning points-steps that flip the local reward trend and make subsequent reward evolution consistent with the overall trajectory trend-and assigns these actions an aggregated long-term reward to capture their delayed impact. Turning points are detected solely via sign changes in incremental rewards, making TP-GRPO efficient and hyperparameter-free. Extensive experiments also demonstrate that TP-GRPO exploits reward signals more effectively and consistently improves generation. Demo code is available at https://github.com/YunzeTong/TurningPoint-GRPO.

Alleviating Sparse Rewards by Modeling Step-Wise and Long-Term Sampling Effects in Flow-Based GRPO

TL;DR

This work tackles reward sparsity and ignored within-trajectory dependencies in flow-based GRPO for diffusion-like image generation. It introduces TurningPoint-GRPO, which replaces terminal rewards with step-wise increments and identifies turning points via sign changes in incremental rewards to assign aggregated long-term rewards that reflect delayed impacts. A consistent turning-point variant and an initial-step long-term effect scheme further refine credit assignment, enabling efficient, hyperparameter-free detection of turning points. Empirically, TP-GRPO improves generation quality across three diverse tasks and demonstrates faster convergence and robust performance relative to Flow-GRPO, illustrating the practical value of explicit implicit-interaction modeling in flow-based RL for vision tasks.

Abstract

Deploying GRPO on Flow Matching models has proven effective for text-to-image generation. However, existing paradigms typically propagate an outcome-based reward to all preceding denoising steps without distinguishing the local effect of each step. Moreover, current group-wise ranking mainly compares trajectories at matched timesteps and ignores within-trajectory dependencies, where certain early denoising actions can affect later states via delayed, implicit interactions. We propose TurningPoint-GRPO (TP-GRPO), a GRPO framework that alleviates step-wise reward sparsity and explicitly models long-term effects within the denoising trajectory. TP-GRPO makes two key innovations: (i) it replaces outcome-based rewards with step-level incremental rewards, providing a dense, step-aware learning signal that better isolates each denoising action's "pure" effect, and (ii) it identifies turning points-steps that flip the local reward trend and make subsequent reward evolution consistent with the overall trajectory trend-and assigns these actions an aggregated long-term reward to capture their delayed impact. Turning points are detected solely via sign changes in incremental rewards, making TP-GRPO efficient and hyperparameter-free. Extensive experiments also demonstrate that TP-GRPO exploits reward signals more effectively and consistently improves generation. Demo code is available at https://github.com/YunzeTong/TurningPoint-GRPO.
Paper Structure (31 sections, 3 theorems, 40 equations, 11 figures, 2 tables, 1 algorithm)

This paper contains 31 sections, 3 theorems, 40 equations, 11 figures, 2 tables, 1 algorithm.

Key Result

Lemma 3.1

For any turning point selected by Definition def: turning point, the sign of its local reward and aggregated long-term reward is the same, i.e., $r_t \cdot r_t^\text{agg} > 0$.

Figures (11)

  • Figure 1: Rewards of several sampled trajectories. Each dot at $t$ is obtained by $(10-t)$ steps of SDE sampling followed by $t$ steps of ODE sampling. The leftmost point corresponds to full ODE sampling, and the rightmost to full SDE sampling (i.e., standard Flow-GRPO outputs).
  • Figure 2: Some cases that are or are not identified as turning points. The first row shows cases that do not satisfy our turning-point definition and are optimized with $r_t$. The second row shows cases that do satisfy it and are optimized with $r_t^\text{agg}$.
  • Figure 3: Overview of our method. For each trajectory, we compute stepwise rewards as the pure incremental effect of the current SDE sampling. We then identify the orange turning point that satisfies Definition \ref{['def: turning point']} or Remark \ref{['def: constraint on first step']}. Next, we assign cumulative rewards to capture their implicit impact on reversing the reward trend. Finally, we apply group normalization independently at each timestep.
  • Figure 4: Training curves on three evaluation tasks. The two TP-GRPO variants differ in whether they apply the consistency constraint in Definition \ref{['def: consistent turning point']}.
  • Figure 5: Qualitative comparison across three tasks. Compositional Image Generation, Visual Text Rendering, and Human Preference Alignment, respectively, assess color/counting, text rendering, and content alignment (including aesthetics).
  • ...and 6 more figures

Theorems & Definitions (9)

  • Definition 4.1
  • Definition 5.1
  • Remark 5.2
  • Lemma 3.1
  • proof
  • Lemma 3.2
  • proof
  • Lemma 3.3
  • proof