Table of Contents
Fetching ...

Aligning Few-Step Diffusion Models with Dense Reward Difference Learning

Ziyi Zhang, Li Shen, Sen Zhang, Deheng Ye, Yong Luo, Miaojing Shi, Dongjing Shan, Bo Du, Dacheng Tao

TL;DR

This work proposes Stepwise Diffusion Policy Optimization, a novel RL framework tailored for few-step diffusion models that introduces a dual-state trajectory sampling mechanism, tracking both noisy and predicted clean states at each step to provide dense reward feedback and enable low-variance, mixed-step optimization.

Abstract

Few-step diffusion models enable efficient high-resolution image synthesis but struggle to align with specific downstream objectives due to limitations of existing reinforcement learning (RL) methods in low-step regimes with limited state spaces and suboptimal sample quality. To address this, we propose Stepwise Diffusion Policy Optimization (SDPO), a novel RL framework tailored for few-step diffusion models. SDPO introduces a dual-state trajectory sampling mechanism, tracking both noisy and predicted clean states at each step to provide dense reward feedback and enable low-variance, mixed-step optimization. For further efficiency, we develop a latent similarity-based dense reward prediction strategy to minimize costly dense reward queries. Leveraging these dense rewards, SDPO optimizes a dense reward difference learning objective that enables more frequent and granular policy updates. Additional refinements, including stepwise advantage estimates, temporal importance weighting, and step-shuffled gradient updates, further enhance long-term dependency, low-step priority, and gradient stability. Our experiments demonstrate that SDPO consistently delivers superior reward-aligned results across diverse few-step settings and tasks. Code is available at https://github.com/ZiyiZhang27/sdpo.

Aligning Few-Step Diffusion Models with Dense Reward Difference Learning

TL;DR

This work proposes Stepwise Diffusion Policy Optimization, a novel RL framework tailored for few-step diffusion models that introduces a dual-state trajectory sampling mechanism, tracking both noisy and predicted clean states at each step to provide dense reward feedback and enable low-variance, mixed-step optimization.

Abstract

Few-step diffusion models enable efficient high-resolution image synthesis but struggle to align with specific downstream objectives due to limitations of existing reinforcement learning (RL) methods in low-step regimes with limited state spaces and suboptimal sample quality. To address this, we propose Stepwise Diffusion Policy Optimization (SDPO), a novel RL framework tailored for few-step diffusion models. SDPO introduces a dual-state trajectory sampling mechanism, tracking both noisy and predicted clean states at each step to provide dense reward feedback and enable low-variance, mixed-step optimization. For further efficiency, we develop a latent similarity-based dense reward prediction strategy to minimize costly dense reward queries. Leveraging these dense rewards, SDPO optimizes a dense reward difference learning objective that enables more frequent and granular policy updates. Additional refinements, including stepwise advantage estimates, temporal importance weighting, and step-shuffled gradient updates, further enhance long-term dependency, low-step priority, and gradient stability. Our experiments demonstrate that SDPO consistently delivers superior reward-aligned results across diverse few-step settings and tasks. Code is available at https://github.com/ZiyiZhang27/sdpo.

Paper Structure

This paper contains 19 sections, 16 equations, 8 figures, 2 tables, 1 algorithm.

Figures (8)

  • Figure 1: Generated images for unseen prompts from: a pretrained few-step diffusion model (SD-Turbo add), and models finetuned with DDPO ddpo and our SDPO, both using PickScore pick and the same number of training samples. All images are generated using the same random seed (42). Our SDPO consistently delivers high-quality, reward-aligned images across various few-step settings, whereas DDPO falters in generating high-quality few-step samples and yields noticeably blurrier images than even the pretrained model.
  • Figure 2: Dual-state sampling vs. standard sampling. Unlike the standard sampling process of diffusion models, our dual-state sampling approach maps final outputs from trajectories of varying lengths onto a shared sequence of intermediate clean states $\{\hat{\mathbf{x}}_0^t\}_{t=0}^{T-1}$, enabling dense reward feedback over a mixed-step trajectory with low variance and consistent denoising dynamics.
  • Figure 3: SDPO framework. SDPO first samples a pair of dual-state trajectories$\{\mathbf{x}_t^{a}, \hat{\mathbf{x}}_0^{t,a}\}_{t=0}^{T-1}$ and $\{\mathbf{x}_0^{t,b}, \hat{\mathbf{x}}_0^{t,b}\}_{t=0}^{T-1}$ using a shared prompt$c$ and initial noise$\mathbf{x}_T$. It then queries the reward function$R$ for clean states at the first, final, and anchor ($t_{\text{anchor}}$) steps and predicts dense rewards$\hat{R}_t$ for other steps via latent similarity, ultimately yielding the stepwise advantage estimate$\hat{A}_t$. Finally, at each shuffled step$\tau_t$, the MSE loss between the advantage difference$\Delta\hat{A}_{\tau_t}$ and the log-ratio difference$\Delta\tilde{\rho}_{\tau_t}$ (weighted by $\lambda^{(T-\tau_t-1)}/\eta$) is computed.
  • Figure 4: Reward curves for low-step samples, where reward scores are evaluated and averaged over 1-, 2-, 4-, and 8-step samples. The horizontal axis shows the cumulative number of training samples, equivalent to the number of training iterations multiplied by the batch size, which is consistent across all methods.
  • Figure 5: Ablation study on dense reward prediction (left & middle) and discounted returns (right) for SDPO.
  • ...and 3 more figures