Table of Contents
Fetching ...

Rethinking the Design Space of Reinforcement Learning for Diffusion Models: On the Importance of Likelihood Estimation Beyond Loss Design

Jaemoo Choi, Yuchen Zhu, Wei Guo, Petr Molodyk, Bo Yuan, Jinbin Bai, Yi Xin, Molei Tao, Yongxin Chen

TL;DR

This work systematically analyzes reinforcement learning for diffusion and flow models by disentangling policy-gradient losses, likelihood estimators, and sampling strategies. It demonstrates that ELBO-based likelihood estimation computed from final samples is the key factor enabling efficient, stable optimization, outweighing the influence of the specific policy-gradient objective. Empirically, the combination of ELBO-based likelihood estimation and ODE sampling yields substantial training efficiency gains, achieving GenEval 0.95 in about 90 GPU hours and surpassing FlowGRPO and DiffusionNFT on multiple reward benchmarks. The study provides a unified framework and robust evidence that likelihood estimation quality drives RL performance, with practical implications for speeding up reward-based diffusion fine-tuning and guiding future scaling to more challenging visual tasks.

Abstract

Reinforcement learning has been widely applied to diffusion and flow models for visual tasks such as text-to-image generation. However, these tasks remain challenging because diffusion models have intractable likelihoods, which creates a barrier for directly applying popular policy-gradient type methods. Existing approaches primarily focus on crafting new objectives built on already heavily engineered LLM objectives, using ad hoc estimators for likelihood, without a thorough investigation into how such estimation affects overall algorithmic performance. In this work, we provide a systematic analysis of the RL design space by disentangling three factors: i) policy-gradient objectives, ii) likelihood estimators, and iii) rollout sampling schemes. We show that adopting an evidence lower bound (ELBO) based model likelihood estimator, computed only from the final generated sample, is the dominant factor enabling effective, efficient, and stable RL optimization, outweighing the impact of the specific policy-gradient loss functional. We validate our findings across multiple reward benchmarks using SD 3.5 Medium, and observe consistent trends across all tasks. Our method improves the GenEval score from 0.24 to 0.95 in 90 GPU hours, which is $4.6\times$ more efficient than FlowGRPO and $2\times$ more efficient than the SOTA method DiffusionNFT without reward hacking.

Rethinking the Design Space of Reinforcement Learning for Diffusion Models: On the Importance of Likelihood Estimation Beyond Loss Design

TL;DR

This work systematically analyzes reinforcement learning for diffusion and flow models by disentangling policy-gradient losses, likelihood estimators, and sampling strategies. It demonstrates that ELBO-based likelihood estimation computed from final samples is the key factor enabling efficient, stable optimization, outweighing the influence of the specific policy-gradient objective. Empirically, the combination of ELBO-based likelihood estimation and ODE sampling yields substantial training efficiency gains, achieving GenEval 0.95 in about 90 GPU hours and surpassing FlowGRPO and DiffusionNFT on multiple reward benchmarks. The study provides a unified framework and robust evidence that likelihood estimation quality drives RL performance, with practical implications for speeding up reward-based diffusion fine-tuning and guiding future scaling to more challenging visual tasks.

Abstract

Reinforcement learning has been widely applied to diffusion and flow models for visual tasks such as text-to-image generation. However, these tasks remain challenging because diffusion models have intractable likelihoods, which creates a barrier for directly applying popular policy-gradient type methods. Existing approaches primarily focus on crafting new objectives built on already heavily engineered LLM objectives, using ad hoc estimators for likelihood, without a thorough investigation into how such estimation affects overall algorithmic performance. In this work, we provide a systematic analysis of the RL design space by disentangling three factors: i) policy-gradient objectives, ii) likelihood estimators, and iii) rollout sampling schemes. We show that adopting an evidence lower bound (ELBO) based model likelihood estimator, computed only from the final generated sample, is the dominant factor enabling effective, efficient, and stable RL optimization, outweighing the impact of the specific policy-gradient loss functional. We validate our findings across multiple reward benchmarks using SD 3.5 Medium, and observe consistent trends across all tasks. Our method improves the GenEval score from 0.24 to 0.95 in 90 GPU hours, which is more efficient than FlowGRPO and more efficient than the SOTA method DiffusionNFT without reward hacking.
Paper Structure (34 sections, 1 theorem, 36 equations, 11 figures, 5 tables, 1 algorithm)

This paper contains 34 sections, 1 theorem, 36 equations, 11 figures, 5 tables, 1 algorithm.

Key Result

Theorem 3.1

eq:epg, eq:pepg, and eq:par share the optimal minimizer $\pi_{*}(\bm{x}) \propto \pi_{\text{ref}}(\bm{x})\exp(R(\bm{x})/\beta)$.

Figures (11)

  • Figure 1: Training efficiency and design-space analysis for reward-based diffusion fine-tuning.(Left) GenEval performance across training time for various fine-tuning methods on SD3.5-Medium. (Right) Conceptual summary of the design space considered in this work, highlighting policy-gradient loss design, likelihood estimation, and sampling strategy.
  • Figure 2: Training time comparison on GenEval. We report the total GPU hours (8$\times$H100) required to reach a GenEval score of 0.95 for different fine-tuning methods. ELBO-based likelihood estimation substantially reduces training cost compared to trajectory-based approaches, and ODE sampling further improves efficiency while achieving the same target performance.
  • Figure 3: Qualitative comparison between benchmarks and our model. See App. \ref{['app:results']} for additional figures.
  • Figure 4: The # of prompts seen in training vs. GPU hours to reach GenEval score of 0.95.
  • Figure 5: Ablation on ELBO estimators on GenEval.
  • ...and 6 more figures

Theorems & Definitions (3)

  • Theorem 3.1: Mathematical Validity of PG Objectives
  • proof
  • proof