Table of Contents
Fetching ...

Expand and Prune: Maximizing Trajectory Diversity for Effective GRPO in Generative Models

Shiran Ge, Chenyi Huang, Yuang Ai, Qihang Fan, Huaibo Huang, Ran He

TL;DR

<3-5 sentence high-level summary> We tackle the computational bottleneck of Group Relative Policy Optimization (GRPO) in aligning generative models with human preferences by uncovering reward clustering and introducing a variance-aware pruning approach. We first demonstrate that a high-variance, OVF-selected subset can outperform larger, unfiltered groups, then solve the overhead with Pro-GRPO, a dynamic, latent-feature pruning framework that prunes during sampling and employs Expand-and-Prune to maximize trajectory diversity. The framework is validated on both diffusion-based and flow-based models, showing improved alignment metrics and substantial speedups over prior GRPO variants. These methods enable scalable, robust human-preference alignment for high-fidelity text-to-image synthesis and related generative tasks.

Abstract

Group Relative Policy Optimization (GRPO) is a powerful technique for aligning generative models, but its effectiveness is bottlenecked by the conflict between large group sizes and prohibitive computational costs. In this work, we investigate the trade-off through empirical studies, yielding two key observations. First, we discover the reward clustering phenomenon in which many trajectories collapse toward the group-mean reward, offering limited optimization value. Second, we design a heuristic strategy named Optimal Variance Filtering (OVF), and verify that a high-variance subset of trajectories, selected by OVF can outperform the larger, unfiltered group. However, this static, post-sampling OVF approach still necessitates critical computational overhead, as it performs unnecessary sampling for trajectories that are ultimately discarded. To resolve this, we propose Pro-GRPO (Proactive GRPO), a novel dynamic framework that integrates latent feature-based trajectory pruning into the sampling process. Through the early termination of reward-clustered trajectories, Pro-GRPO reduces computational overhead. Leveraging its efficiency, Pro-GRPO employs an "Expand-and-Prune" strategy. This strategy first expands the size of initial sampling group to maximize trajectory diversity, then it applies multi-step OVF to the latents, avoiding prohibitive computational costs. Extensive experiments on both diffusion-based and flow-based models demonstrate the generality and effectiveness of our Pro-GRPO framework.

Expand and Prune: Maximizing Trajectory Diversity for Effective GRPO in Generative Models

TL;DR

<3-5 sentence high-level summary> We tackle the computational bottleneck of Group Relative Policy Optimization (GRPO) in aligning generative models with human preferences by uncovering reward clustering and introducing a variance-aware pruning approach. We first demonstrate that a high-variance, OVF-selected subset can outperform larger, unfiltered groups, then solve the overhead with Pro-GRPO, a dynamic, latent-feature pruning framework that prunes during sampling and employs Expand-and-Prune to maximize trajectory diversity. The framework is validated on both diffusion-based and flow-based models, showing improved alignment metrics and substantial speedups over prior GRPO variants. These methods enable scalable, robust human-preference alignment for high-fidelity text-to-image synthesis and related generative tasks.

Abstract

Group Relative Policy Optimization (GRPO) is a powerful technique for aligning generative models, but its effectiveness is bottlenecked by the conflict between large group sizes and prohibitive computational costs. In this work, we investigate the trade-off through empirical studies, yielding two key observations. First, we discover the reward clustering phenomenon in which many trajectories collapse toward the group-mean reward, offering limited optimization value. Second, we design a heuristic strategy named Optimal Variance Filtering (OVF), and verify that a high-variance subset of trajectories, selected by OVF can outperform the larger, unfiltered group. However, this static, post-sampling OVF approach still necessitates critical computational overhead, as it performs unnecessary sampling for trajectories that are ultimately discarded. To resolve this, we propose Pro-GRPO (Proactive GRPO), a novel dynamic framework that integrates latent feature-based trajectory pruning into the sampling process. Through the early termination of reward-clustered trajectories, Pro-GRPO reduces computational overhead. Leveraging its efficiency, Pro-GRPO employs an "Expand-and-Prune" strategy. This strategy first expands the size of initial sampling group to maximize trajectory diversity, then it applies multi-step OVF to the latents, avoiding prohibitive computational costs. Extensive experiments on both diffusion-based and flow-based models demonstrate the generality and effectiveness of our Pro-GRPO framework.

Paper Structure

This paper contains 19 sections, 20 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Reward clustering Phenomenon and OVF effects. (a) A full group ($G=24$) exhibits pronounced reward clustering. (b) Uniform subsampling ($k=12$) preserves the clustering. (c) Our OVF ($k=12$) alleviates reward clustering by selecting from the reward extremes.
  • Figure 2: Visualization of training dynamics on PickScore. We compare the Baseline ($G=24$) against Uniform Subsampling ($k=12$) and our OVF strategy ($k=12$). (a) Reward standard deviation ($\sigma_G$). (b) PickScore Eval.
  • Figure 3: Overview of the Pro-GRPO. Pro-GRPO runs a dynamic expand-and-prune schedule inside the $T$-step denoising. We begin with an expanded group $G_{\max}$ to maximize exploration. At intermediate checkpoints $S_i$, (A) Dynamic latent pruning deterministically projects each active latent to $T$ via a single-step ODE update to obtain a proxy sample and reward $\hat{R}_i$; OVF then keeps a high-variance subset and early-terminates the rest (red crosses), progressively narrowing the group to $K$ survivors at $S_T$. (B) GRPO on the pruned group computes group-normalized advantages over the $K$ survivors and updates the policy, achieve high performance with low computational cost.
  • Figure 4: Qualitative comparison between SD3.5-M, Flow-GRPO and Pro-GRPO with Pickscore as reward on DrawBench prompts.
  • Figure 5: Training dynamics. Reward trajectories during optimization. (a) Flow-based (SD3.5, PickScore): Pro-GRPO (blue) and Pro-GRPO-Flash (green) converge faster and reach higher plateaus than Flow-GRPO (orange). (b) Diffusion-based (SD-v1.4, HPSv2.1): Pro-GRPO consistently outperforms DanceGRPO throughout training. (c) Diffusion-based (SD-v1.4, HPSv2.1 & CLIP): Pro-GRPO maintains a stable margin, indicating stronger multi-objective optimization.