Table of Contents
Fetching ...

BranchGRPO: Stable and Efficient GRPO with Structured Branching in Diffusion Models

Yuming Li, Yikai Wang, Yuying Zhu, Zhongyu Zhao, Ming Lu, Qi She, Shanghang Zhang

TL;DR

This work tackles inefficiency and unstable credit assignment in GRPO for diffusion-based image and video generation. It introduces BranchGRPO, a tree-structured rollout with shared prefixes, dense step-level rewards via reward fusion and depth-wise normalization, and pruning strategies to reduce backpropagation cost. The main contributions are the branching rollout framework, the reward fusion mechanism, and the pruning schemes, demonstrated to yield faster convergence and higher alignment on HPDv2.1 and WanX. Scaling experiments show that larger branch sizes improve performance, and video results indicate improved temporal coherence and sharper frames.

Abstract

Recent progress in aligning image and video generative models with Group Relative Policy Optimization (GRPO) has improved human preference alignment, but existing variants remain inefficient due to sequential rollouts and large numbers of sampling steps, unreliable credit assignment: sparse terminal rewards are uniformly propagated across timesteps, failing to capture the varying criticality of decisions during denoising. In this paper, we present BranchGRPO, a method that restructures the rollout process into a branching tree, where shared prefixes amortize computation and pruning removes low-value paths and redundant depths. BranchGRPO introduces three contributions: (1) a branching scheme that amortizes rollout cost through shared prefixes while preserving exploration diversity; (2) a reward fusion and depth-wise advantage estimator that transforms sparse terminal rewards into dense step-level signals; and (3) pruning strategies that cut gradient computation but leave forward rollouts and exploration unaffected. On HPDv2.1 image alignment, BranchGRPO improves alignment scores by up to \textbf{16\%} over DanceGRPO, while reducing per-iteration training time by nearly \textbf{55\%}. A hybrid variant, BranchGRPO-Mix, further accelerates training to 4.7x faster than DanceGRPO without degrading alignment. On WanX video generation, it further achieves higher Video-Align scores with sharper and temporally consistent frames compared to DanceGRPO. Codes are available at \href{https://fredreic1849.github.io/BranchGRPO-Webpage/}{BranchGRPO}.

BranchGRPO: Stable and Efficient GRPO with Structured Branching in Diffusion Models

TL;DR

This work tackles inefficiency and unstable credit assignment in GRPO for diffusion-based image and video generation. It introduces BranchGRPO, a tree-structured rollout with shared prefixes, dense step-level rewards via reward fusion and depth-wise normalization, and pruning strategies to reduce backpropagation cost. The main contributions are the branching rollout framework, the reward fusion mechanism, and the pruning schemes, demonstrated to yield faster convergence and higher alignment on HPDv2.1 and WanX. Scaling experiments show that larger branch sizes improve performance, and video results indicate improved temporal coherence and sharper frames.

Abstract

Recent progress in aligning image and video generative models with Group Relative Policy Optimization (GRPO) has improved human preference alignment, but existing variants remain inefficient due to sequential rollouts and large numbers of sampling steps, unreliable credit assignment: sparse terminal rewards are uniformly propagated across timesteps, failing to capture the varying criticality of decisions during denoising. In this paper, we present BranchGRPO, a method that restructures the rollout process into a branching tree, where shared prefixes amortize computation and pruning removes low-value paths and redundant depths. BranchGRPO introduces three contributions: (1) a branching scheme that amortizes rollout cost through shared prefixes while preserving exploration diversity; (2) a reward fusion and depth-wise advantage estimator that transforms sparse terminal rewards into dense step-level signals; and (3) pruning strategies that cut gradient computation but leave forward rollouts and exploration unaffected. On HPDv2.1 image alignment, BranchGRPO improves alignment scores by up to \textbf{16\%} over DanceGRPO, while reducing per-iteration training time by nearly \textbf{55\%}. A hybrid variant, BranchGRPO-Mix, further accelerates training to 4.7x faster than DanceGRPO without degrading alignment. On WanX video generation, it further achieves higher Video-Align scores with sharper and temporally consistent frames compared to DanceGRPO. Codes are available at \href{https://fredreic1849.github.io/BranchGRPO-Webpage/}{BranchGRPO}.

Paper Structure

This paper contains 35 sections, 3 theorems, 20 equations, 16 figures, 4 tables, 1 algorithm.

Key Result

Lemma 1

For any fixed parent $z_i$, we have

Figures (16)

  • Figure 1: Comparison of BranchGRPO and DanceGRPO.Left: Reward curves during training. BranchGRPO converges substantially faster, achieving up to 2.2$\times$ speedup over DanceGRPO (time fraction $=1.0$) and 1.5$\times$ speedup over DanceGRPO (time fraction $=0.6$), while ultimately surpassing both baselines. The time fraction $=0.6$ variant further exhibits pronounced instability. ((time fraction denotes the proportion of diffusion timesteps used during training.).) Right: Illustration of rollout structures. GRPO relies on sequential rollouts with only final rewards, whereas BranchGRPO performs branching at intermediate steps and propagates dense rewards backward, enabling more efficient and stable optimization.
  • Figure 2: Comparison of sequential and branch rollouts. Left/Right: example generations from DanceGRPO and BranchGRPO, respectively. Middle: distribution of sampled images projected into 2D feature space, where red and blue dots correspond to DanceGRPO and BranchGRPO.
  • Figure 3: Left: branch rollout process . Middle: leaf rewards are fused upward. Right: depth-wise normalization and pruning yield dense advantages and reduce cost.
  • Figure 4: Qualitative comparison of generations from Flux, DanceGRPO, and our BranchGRPO.
  • Figure 5: Ablation studies of BranchGRPO. Moderate branch correlation, early and denser splits improve reward growth; path-weighted fusion enhances stability; depth pruning achieves the best final reward; and the hybrid ODE--SDE provides the fastest training speed while remaining stable.
  • ...and 11 more figures

Theorems & Definitions (3)

  • Lemma 1: Single-step marginal preservation
  • Lemma 2: Leaf marginal preservation
  • Theorem 1: Boundary distribution invariance