Table of Contents
Fetching ...

TreeGRPO: Tree-Advantage GRPO for Online RL Post-Training of Diffusion Models

Zheng Ding, Weirui Ye

TL;DR

TreeGRPO tackles the prohibitive cost of RL post-training for diffusion/flow-based visual generators by recasting denoising as a sparse, tree-structured search that reuses shared prefixes. It introduces per-edge advantages via leaf-to-root backpropagation and optimizes a GRPO objective, yielding substantial gains in sample efficiency and stability. The approach achieves a 2.4x improvement in training efficiency while delivering superior Pareto frontiers across multiple reward models, including single- and multi-reward settings. These results demonstrate a scalable, robust pathway for RL-based aesthetic and alignment tuning of large-scale visual generative models, with potential extensions to more demanding domains like video and 3D generation.

Abstract

Reinforcement learning (RL) post-training is crucial for aligning generative models with human preferences, but its prohibitive computational cost remains a major barrier to widespread adoption. We introduce \textbf{TreeGRPO}, a novel RL framework that dramatically improves training efficiency by recasting the denoising process as a search tree. From shared initial noise samples, TreeGRPO strategically branches to generate multiple candidate trajectories while efficiently reusing their common prefixes. This tree-structured approach delivers three key advantages: (1) \emph{High sample efficiency}, achieving better performance under same training samples (2) \emph{Fine-grained credit assignment} via reward backpropagation that computes step-specific advantages, overcoming the uniform credit assignment limitation of trajectory-based methods, and (3) \emph{Amortized computation} where multi-child branching enables multiple policy updates per forward pass. Extensive experiments on both diffusion and flow-based models demonstrate that TreeGRPO achieves \textbf{2.4$\times$ faster training} while establishing a superior Pareto frontier in the efficiency-reward trade-off space. Our method consistently outperforms GRPO baselines across multiple benchmarks and reward models, providing a scalable and effective pathway for RL-based visual generative model alignment. The project website is available at treegrpo.github.io.

TreeGRPO: Tree-Advantage GRPO for Online RL Post-Training of Diffusion Models

TL;DR

TreeGRPO tackles the prohibitive cost of RL post-training for diffusion/flow-based visual generators by recasting denoising as a sparse, tree-structured search that reuses shared prefixes. It introduces per-edge advantages via leaf-to-root backpropagation and optimizes a GRPO objective, yielding substantial gains in sample efficiency and stability. The approach achieves a 2.4x improvement in training efficiency while delivering superior Pareto frontiers across multiple reward models, including single- and multi-reward settings. These results demonstrate a scalable, robust pathway for RL-based aesthetic and alignment tuning of large-scale visual generative models, with potential extensions to more demanding domains like video and 3D generation.

Abstract

Reinforcement learning (RL) post-training is crucial for aligning generative models with human preferences, but its prohibitive computational cost remains a major barrier to widespread adoption. We introduce \textbf{TreeGRPO}, a novel RL framework that dramatically improves training efficiency by recasting the denoising process as a search tree. From shared initial noise samples, TreeGRPO strategically branches to generate multiple candidate trajectories while efficiently reusing their common prefixes. This tree-structured approach delivers three key advantages: (1) \emph{High sample efficiency}, achieving better performance under same training samples (2) \emph{Fine-grained credit assignment} via reward backpropagation that computes step-specific advantages, overcoming the uniform credit assignment limitation of trajectory-based methods, and (3) \emph{Amortized computation} where multi-child branching enables multiple policy updates per forward pass. Extensive experiments on both diffusion and flow-based models demonstrate that TreeGRPO achieves \textbf{2.4 faster training} while establishing a superior Pareto frontier in the efficiency-reward trade-off space. Our method consistently outperforms GRPO baselines across multiple benchmarks and reward models, providing a scalable and effective pathway for RL-based visual generative model alignment. The project website is available at treegrpo.github.io.

Paper Structure

This paper contains 35 sections, 2 theorems, 18 equations, 2 figures, 4 tables, 1 algorithm.

Key Result

Proposition 5.1

Let $\sigma^2_{\text{env}}$ be the variance of the reward realization due to future diffusion noise. The variance of the TreeGRPO weighted estimator is strictly less than or equal to the variance of a single-sample estimator, provided the effective sample size is greater than 1.

Figures (2)

  • Figure 1: The proposed TreeGRPO achieves the best pareto performance across the rewards and training efficiency, where the single-GPU runtime is the normalized wall-clock time. In (a), following the normalized metrics in RL domains mnih2013playing, the nromalized reward scores here is calculated by $(r - r_{sd3.5}) / (r_{max} - r_{sd3.5})$, where the $r_{max}$ in the HPS, ImageReward, Asethetic, ClipScore reward models are $\{1.0, 2.0, 10.0, 1.0\}$ respectively.
  • Figure 2: Introduction of TreeGRPO: Our framework optimizes the denoising process of diffusion/flow models by constructing search trees. Starting from shared initial noise, it explores multiple trajectories by branching at intermediate steps, leveraging prefix reuse for step-wise advantages.

Theorems & Definitions (4)

  • Proposition 5.1: Variance Reduction with Weighted Estimator
  • proof
  • Proposition 5.2: Weighted Averaging as Smoothness Regularization
  • proof