Table of Contents
Fetching ...

Diffusion Fine-Tuning via Reparameterized Policy Gradient of the Soft Q-Function

Hyeongyu Kang, Jaewoo Lee, Woocheol Shin, Kiyoung Om, Jinkyoo Park

TL;DR

This work tackles reward over-optimization in diffusion-model fine-tuning by proposing SQDF, a KL-regularized reinforcement learning method that uses a training-free soft Q-function and a reparameterized policy gradient to update the denoising process. It introduces three stabilization techniques—a discount factor for credit assignment, a consistency model for reliable Q estimation, and an off-policy replay buffer to enhance mode coverage. Empirically, SQDF improves target rewards while preserving alignment and diversity in text-to-image tasks (LAION aesthetic, HPSv2) and achieves high sample efficiency in online black-box optimization, outperforming gradient-based and KL-augmented baselines. The approach demonstrates robust performance across backbones (SD1.5, SDXL) and settings, pushing the practical Pareto frontier for diffusion fine-tuning.

Abstract

Diffusion models excel at generating high-likelihood samples but often require alignment with downstream objectives. Existing fine-tuning methods for diffusion models significantly suffer from reward over-optimization, resulting in high-reward but unnatural samples and degraded diversity. To mitigate over-optimization, we propose \textbf{Soft Q-based Diffusion Finetuning (SQDF)}, a novel KL-regularized RL method for diffusion alignment that applies a reparameterized policy gradient of a training-free, differentiable estimation of the soft Q-function. SQDF is further enhanced with three innovations: a discount factor for proper credit assignment in the denoising process, the integration of consistency models to refine Q-function estimates, and the use of an off-policy replay buffer to improve mode coverage and manage the reward-diversity trade-off. Our experiments demonstrate that SQDF achieves superior target rewards while preserving diversity in text-to-image alignment. Furthermore, in online black-box optimization, SQDF attains high sample efficiency while maintaining naturalness and diversity.

Diffusion Fine-Tuning via Reparameterized Policy Gradient of the Soft Q-Function

TL;DR

This work tackles reward over-optimization in diffusion-model fine-tuning by proposing SQDF, a KL-regularized reinforcement learning method that uses a training-free soft Q-function and a reparameterized policy gradient to update the denoising process. It introduces three stabilization techniques—a discount factor for credit assignment, a consistency model for reliable Q estimation, and an off-policy replay buffer to enhance mode coverage. Empirically, SQDF improves target rewards while preserving alignment and diversity in text-to-image tasks (LAION aesthetic, HPSv2) and achieves high sample efficiency in online black-box optimization, outperforming gradient-based and KL-augmented baselines. The approach demonstrates robust performance across backbones (SD1.5, SDXL) and settings, pushing the practical Pareto frontier for diffusion fine-tuning.

Abstract

Diffusion models excel at generating high-likelihood samples but often require alignment with downstream objectives. Existing fine-tuning methods for diffusion models significantly suffer from reward over-optimization, resulting in high-reward but unnatural samples and degraded diversity. To mitigate over-optimization, we propose \textbf{Soft Q-based Diffusion Finetuning (SQDF)}, a novel KL-regularized RL method for diffusion alignment that applies a reparameterized policy gradient of a training-free, differentiable estimation of the soft Q-function. SQDF is further enhanced with three innovations: a discount factor for proper credit assignment in the denoising process, the integration of consistency models to refine Q-function estimates, and the use of an off-policy replay buffer to improve mode coverage and manage the reward-diversity trade-off. Our experiments demonstrate that SQDF achieves superior target rewards while preserving diversity in text-to-image alignment. Furthermore, in online black-box optimization, SQDF attains high sample efficiency while maintaining naturalness and diversity.

Paper Structure

This paper contains 62 sections, 49 equations, 21 figures, 4 tables, 2 algorithms.

Figures (21)

  • Figure 1: Overview of the SQDF framework. The process involves two stages: (1) samples generated by the diffusion model $p_\theta$ are stored in a replay buffer; (2) a noisy sample $x_t$ is drawn from the buffer and denoised one step by the diffusion model $p_\theta$. The consistency model $f_\psi$ then takes $x_{t-1}$ as input and predicts the clean sample $\hat{x}_0$. This prediction is evaluated by a reward model $r_\phi$, and the resulting reward gradient is used to update $p_\theta$ via a reparameterized policy gradient.
  • Figure 2: Comparison of multi-step sampling with one-step $x_0$ estimation. (a): DDPM 50-step sampling accurately capture $x_0$ distribution. (b): A one-step $x_0$ estimation via Tweedie's formula is highly inaccurate, particularly at early denoising steps. (c): Consistency model, however, provides an $x_0$ estimate with uniform accuracy across all timesteps.
  • Figure 3: Comparison of evaluation metrics during optimization of the target reward. Top: The target reward is the LAION aesthetic score. Bottom: The target reward is HPSv2. (a), (b), (e), and (f): evaluation of alignment score using ImageReward and HPS. (c), (d), (g), and (h): evaluation of diversity using LPIPS and DreamSim.
  • Figure 4: Comparison of trade-off curves with KL-regularized baselines. Curves are obtained by varying the KL-regularization coefficient $\alpha$. Darker points correspond to a stronger KL-regularizer.
  • Figure 5: Comparison of generated images from different fine-tuning methods, using model checkpoints selected when a reward of 8.0 was achieved (or the maximum reward if 8.0 was not reached). The average reward for the presented images is shown for each method.
  • ...and 16 more figures