Table of Contents
Fetching ...

Stepwise Credit Assignment for GRPO on Flow-Matching Models

Yash Savani, Branislav Kveton, Yuchen Liu, Yilin Wang, Jing Shi, Subhojyoti Mukherjee, Nikos Vlassis, Krishna Kumar Singh

Abstract

Flow-GRPO successfully applies reinforcement learning to flow models, but uses uniform credit assignment across all steps. This ignores the temporal structure of diffusion generation: early steps determine composition and content (low-frequency structure), while late steps resolve details and textures (high-frequency details). Moreover, assigning uniform credit based solely on the final image can inadvertently reward suboptimal intermediate steps, especially when errors are corrected later in the diffusion trajectory. We propose Stepwise-Flow-GRPO, which assigns credit based on each step's reward improvement. By leveraging Tweedie's formula to obtain intermediate reward estimates and introducing gain-based advantages, our method achieves superior sample efficiency and faster convergence. We also introduce a DDIM-inspired SDE that improves reward quality while preserving stochasticity for policy gradients.

Stepwise Credit Assignment for GRPO on Flow-Matching Models

Abstract

Flow-GRPO successfully applies reinforcement learning to flow models, but uses uniform credit assignment across all steps. This ignores the temporal structure of diffusion generation: early steps determine composition and content (low-frequency structure), while late steps resolve details and textures (high-frequency details). Moreover, assigning uniform credit based solely on the final image can inadvertently reward suboptimal intermediate steps, especially when errors are corrected later in the diffusion trajectory. We propose Stepwise-Flow-GRPO, which assigns credit based on each step's reward improvement. By leveraging Tweedie's formula to obtain intermediate reward estimates and introducing gain-based advantages, our method achieves superior sample efficiency and faster convergence. We also introduce a DDIM-inspired SDE that improves reward quality while preserving stochasticity for policy gradients.

Paper Structure

This paper contains 33 sections, 34 equations, 17 figures, 3 tables, 1 algorithm.

Figures (17)

  • Figure 1: Stepwise credit assignment from temporal reward structure.(Left) Two trajectories for the same prompt, showing Tweedie estimates $\hat{x}_0^i(t)$ and their PickScore rewards $r_t^i$ at each denoising step. The reward curves are non-monotonic and frequently cross---trajectory 0 (blue) dips at $t{=}0.86$ before recovering, while trajectory 1 (orange) drops sharply at $t{=}0.71$---yet both reach similar final rewards (${\sim}0.90$). Uniform credit assignment would treat these trajectories nearly identically, reinforcing the poor intermediate steps along with the good ones. Stepwise-Flow-GRPO instead uses gains $g_t^i = r_{t-1}^i - r_t^i$ to penalize steps that hurt reward and credit steps that improve it, regardless of final outcome. (Right) This finer credit assignment yields faster convergence and higher final reward compared to Flow-GRPO.
  • Figure 2: Gain magnitudes across steps. Mean absolute gain $\mathop{\mathrm{\mathbb{E}}}\nolimits_i[|g_t^i|]$ measured on 256 GenEval prompts using PickScore. Early steps show larger gains, indicating that compositional decisions drive most reward improvement.
  • Figure 3: Qualitative results. We compare our Stepwise-Flow-GRPO with Flow-GRPO and observe better spatial reasoning, attribute binding, and counting performance.
  • Figure 4: Sample efficiency across reward functions. Stepwise-Flow-GRPO consistently outperforms Flow-GRPO in reward per training step across all settings, achieving both faster convergence and superior final performance in 3 out of 4 settings.
  • Figure 5: Wall-clock efficiency matches sample efficiency gains. Reward versus wall-clock time for the same settings as \ref{['fig:flowsde_results']}. Despite additional computational cost for intermediate denoising, Stepwise-Flow-GRPO converges faster in wall-clock time, achieving visibly superior performance in 3 out of 4 settings.
  • ...and 12 more figures