Aligning Few-Step Diffusion Models with Dense Reward Difference Learning

Ziyi Zhang; Li Shen; Sen Zhang; Deheng Ye; Yong Luo; Miaojing Shi; Dongjing Shan; Bo Du; Dacheng Tao

Aligning Few-Step Diffusion Models with Dense Reward Difference Learning

Ziyi Zhang, Li Shen, Sen Zhang, Deheng Ye, Yong Luo, Miaojing Shi, Dongjing Shan, Bo Du, Dacheng Tao

TL;DR

This work proposes Stepwise Diffusion Policy Optimization, a novel RL framework tailored for few-step diffusion models that introduces a dual-state trajectory sampling mechanism, tracking both noisy and predicted clean states at each step to provide dense reward feedback and enable low-variance, mixed-step optimization.

Abstract

Few-step diffusion models enable efficient high-resolution image synthesis but struggle to align with specific downstream objectives due to limitations of existing reinforcement learning (RL) methods in low-step regimes with limited state spaces and suboptimal sample quality. To address this, we propose Stepwise Diffusion Policy Optimization (SDPO), a novel RL framework tailored for few-step diffusion models. SDPO introduces a dual-state trajectory sampling mechanism, tracking both noisy and predicted clean states at each step to provide dense reward feedback and enable low-variance, mixed-step optimization. For further efficiency, we develop a latent similarity-based dense reward prediction strategy to minimize costly dense reward queries. Leveraging these dense rewards, SDPO optimizes a dense reward difference learning objective that enables more frequent and granular policy updates. Additional refinements, including stepwise advantage estimates, temporal importance weighting, and step-shuffled gradient updates, further enhance long-term dependency, low-step priority, and gradient stability. Our experiments demonstrate that SDPO consistently delivers superior reward-aligned results across diverse few-step settings and tasks. Code is available at https://github.com/ZiyiZhang27/sdpo.

Aligning Few-Step Diffusion Models with Dense Reward Difference Learning

TL;DR

Abstract

Aligning Few-Step Diffusion Models with Dense Reward Difference Learning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)