Table of Contents
Fetching ...

Reward Fine-Tuning Two-Step Diffusion Models via Learning Differentiable Latent-Space Surrogate Reward

Zhiwei Jia, Yuesong Nan, Huixi Zhao, Gengdai Liu

TL;DR

This work tackles the difficulty of fine-tuning ultra-fast ≤2-step diffusion models with arbitrary rewards. It presents LaSRO, a two-stage framework that learns differentiable surrogate rewards in the latent space of a pre-trained SDXL backbone to convert non-differentiable signals into actionable gradients, enabling efficient off-policy exploration. By connecting to value-based RL and employing a latent-space surrogate with Bradley–Terry ranking, LaSRO achieves stable, superior improvements over policy-based RL baselines across general image quality and non-differentiable reward tasks, with extensive ablations supporting its design choices. The approach holds practical significance for flexible, scalable alignment of step-distilled diffusion models and potentially other modalities by enabling reward-guided optimization without costly online policy updates.

Abstract

Recent research has shown that fine-tuning diffusion models (DMs) with arbitrary rewards, including non-differentiable ones, is feasible with reinforcement learning (RL) techniques, enabling flexible model alignment. However, applying existing RL methods to step-distilled DMs is challenging for ultra-fast ($\le2$-step) image generation. Our analysis suggests several limitations of policy-based RL methods such as PPO or DPO toward this goal. Based on the insights, we propose fine-tuning DMs with learned differentiable surrogate rewards. Our method, named LaSRO, learns surrogate reward models in the latent space of SDXL to convert arbitrary rewards into differentiable ones for effective reward gradient guidance. LaSRO leverages pre-trained latent DMs for reward modeling and tailors reward optimization for $\le2$-step image generation with efficient off-policy exploration. LaSRO is effective and stable for improving ultra-fast image generation with different reward objectives, outperforming popular RL methods including DDPO and Diffusion-DPO. We further show LaSRO's connection to value-based RL, providing theoretical insights. See our webpage \href{https://sites.google.com/view/lasro}{here}.

Reward Fine-Tuning Two-Step Diffusion Models via Learning Differentiable Latent-Space Surrogate Reward

TL;DR

This work tackles the difficulty of fine-tuning ultra-fast ≤2-step diffusion models with arbitrary rewards. It presents LaSRO, a two-stage framework that learns differentiable surrogate rewards in the latent space of a pre-trained SDXL backbone to convert non-differentiable signals into actionable gradients, enabling efficient off-policy exploration. By connecting to value-based RL and employing a latent-space surrogate with Bradley–Terry ranking, LaSRO achieves stable, superior improvements over policy-based RL baselines across general image quality and non-differentiable reward tasks, with extensive ablations supporting its design choices. The approach holds practical significance for flexible, scalable alignment of step-distilled diffusion models and potentially other modalities by enabling reward-guided optimization without costly online policy updates.

Abstract

Recent research has shown that fine-tuning diffusion models (DMs) with arbitrary rewards, including non-differentiable ones, is feasible with reinforcement learning (RL) techniques, enabling flexible model alignment. However, applying existing RL methods to step-distilled DMs is challenging for ultra-fast (-step) image generation. Our analysis suggests several limitations of policy-based RL methods such as PPO or DPO toward this goal. Based on the insights, we propose fine-tuning DMs with learned differentiable surrogate rewards. Our method, named LaSRO, learns surrogate reward models in the latent space of SDXL to convert arbitrary rewards into differentiable ones for effective reward gradient guidance. LaSRO leverages pre-trained latent DMs for reward modeling and tailors reward optimization for -step image generation with efficient off-policy exploration. LaSRO is effective and stable for improving ultra-fast image generation with different reward objectives, outperforming popular RL methods including DDPO and Diffusion-DPO. We further show LaSRO's connection to value-based RL, providing theoretical insights. See our webpage \href{https://sites.google.com/view/lasro}{here}.

Paper Structure

This paper contains 51 sections, 25 equations, 9 figures, 1 table, 2 algorithms.

Figures (9)

  • Figure 1: Generated images of resolution $1024^2$ with $\le2$ steps via LCM-SSD-1B luo2023latent (baseline) and those fine-tuned with Image Reward xu2024imagereward. $1^{\text{st}}$ column: baseline results. $2^{\text{nd}}$ and $3^{\text{rd}}$: results during and after fine-tuning the baseline via our method LaSRO. $4^{\text{th}}$: fine-tuned via RLCM oertell2024rl (a variant of DDPO black2023training). $5^{\text{th}}$: fine-tuned via PSO miao2024tuning (a variant of Diffusion-DPO wallace2024diffusion). Ours significantly improves the visual quality of $\le2$-step image generation while other strong RL methods fail due to training instability and inefficiency.
  • Figure 2: (left) Given astronaut riding a horse as $\mathbf{c}$, fixed initial noise $\mathbf{x}_{\tau_0}$ and guidance scale, the reduction of noise-injecting steps leads to less diverse images generated from SSD-1B gupta2024progressive and its LCM, and thus harder exploration for $p_\theta(\mathbf{x}_{\tau_H} | \mathbf{x}_{\tau_0}, \mathbf{c})$. (right) Illustration of the LCM mapping $f_{\theta}$. We show the growing empirical local Lipchitz of the LCM mapping from a noisy image to a clean image and to a generic image quality score (details in Appendix \ref{['app:lipchitz']}). The x-axis is the input's noise level ($t$ as in $\mathbf{x}_t$).
  • Figure 3: Training pipeline of the fine-tuning stage of LaSRO. This stage alternates between reward fine-tuning the DM and online adapting the surrogate reward. Since connected to value-based RL, we present LaSRO in the style of actor-critic methods konda1999actor.
  • Figure 4: Comparison of LaSRO and other RL methods for fine-tuning two-step LCMs with Image Reward. Results on the test prompts of T2I-prompt-87K (left) and the (out-of-distribution) prompts of MJHQ-30K (right) show LaSRO effectively improve one-step (dashed lines) and two-step (solid) image generation.
  • Figure 5: (left) LaSRO vs. GORS-LCM (other baselines fail) for fine-tuning LCMs with Attribute Binding Score and Text Alignment Score (all two-step perf. results). (right) Results from ablation studies that validate LaSRO's design (see Sec. \ref{['sec:abl']}).
  • ...and 4 more figures