Diffusion Fine-Tuning via Reparameterized Policy Gradient of the Soft Q-Function

Hyeongyu Kang; Jaewoo Lee; Woocheol Shin; Kiyoung Om; Jinkyoo Park

Diffusion Fine-Tuning via Reparameterized Policy Gradient of the Soft Q-Function

Hyeongyu Kang, Jaewoo Lee, Woocheol Shin, Kiyoung Om, Jinkyoo Park

TL;DR

This work tackles reward over-optimization in diffusion-model fine-tuning by proposing SQDF, a KL-regularized reinforcement learning method that uses a training-free soft Q-function and a reparameterized policy gradient to update the denoising process. It introduces three stabilization techniques—a discount factor for credit assignment, a consistency model for reliable Q estimation, and an off-policy replay buffer to enhance mode coverage. Empirically, SQDF improves target rewards while preserving alignment and diversity in text-to-image tasks (LAION aesthetic, HPSv2) and achieves high sample efficiency in online black-box optimization, outperforming gradient-based and KL-augmented baselines. The approach demonstrates robust performance across backbones (SD1.5, SDXL) and settings, pushing the practical Pareto frontier for diffusion fine-tuning.

Abstract

Diffusion models excel at generating high-likelihood samples but often require alignment with downstream objectives. Existing fine-tuning methods for diffusion models significantly suffer from reward over-optimization, resulting in high-reward but unnatural samples and degraded diversity. To mitigate over-optimization, we propose \textbf{Soft Q-based Diffusion Finetuning (SQDF)}, a novel KL-regularized RL method for diffusion alignment that applies a reparameterized policy gradient of a training-free, differentiable estimation of the soft Q-function. SQDF is further enhanced with three innovations: a discount factor for proper credit assignment in the denoising process, the integration of consistency models to refine Q-function estimates, and the use of an off-policy replay buffer to improve mode coverage and manage the reward-diversity trade-off. Our experiments demonstrate that SQDF achieves superior target rewards while preserving diversity in text-to-image alignment. Furthermore, in online black-box optimization, SQDF attains high sample efficiency while maintaining naturalness and diversity.

Diffusion Fine-Tuning via Reparameterized Policy Gradient of the Soft Q-Function

TL;DR

Abstract

Diffusion Fine-Tuning via Reparameterized Policy Gradient of the Soft Q-Function

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (21)