Table of Contents
Fetching ...

Refining Alignment Framework for Diffusion Models with Intermediate-Step Preference Ranking

Jie Ren, Yuhang Zhang, Dongrui Liu, Xiaopeng Zhang, Qi Tian

TL;DR

This work argues that trajectory-level final-image rankings used in prior diffusion-model DPO methods can misalign with intermediate-step rewards. It proposes TailorPO, which ranks step-wise noisy samples generated from the same denoising input and uses a DPO-style loss to steer optimization, ensuring gradient directions align with human preferences. The addition of TailorPO-G integrates gradient guidance to broaden reward gaps and further boost performance. Empirical results on Stable Diffusion show improved human-aligned aesthetics and generalization across prompts and reward models, highlighting practical impact for more faithful image generation guided by nuanced human preferences.

Abstract

Direct preference optimization (DPO) has shown success in aligning diffusion models with human preference. Previous approaches typically assume a consistent preference label between final generations and noisy samples at intermediate steps, and directly apply DPO to these noisy samples for fine-tuning. However, we theoretically identify inherent issues in this assumption and its impacts on the effectiveness of preference alignment. We first demonstrate the inherent issues from two perspectives: gradient direction and preference order, and then propose a Tailored Preference Optimization (TailorPO) framework for aligning diffusion models with human preference, underpinned by some theoretical insights. Our approach directly ranks intermediate noisy samples based on their step-wise reward, and effectively resolves the gradient direction issues through a simple yet efficient design. Additionally, we incorporate the gradient guidance of diffusion models into preference alignment to further enhance the optimization effectiveness. Experimental results demonstrate that our method significantly improves the model's ability to generate aesthetically pleasing and human-preferred images.

Refining Alignment Framework for Diffusion Models with Intermediate-Step Preference Ranking

TL;DR

This work argues that trajectory-level final-image rankings used in prior diffusion-model DPO methods can misalign with intermediate-step rewards. It proposes TailorPO, which ranks step-wise noisy samples generated from the same denoising input and uses a DPO-style loss to steer optimization, ensuring gradient directions align with human preferences. The addition of TailorPO-G integrates gradient guidance to broaden reward gaps and further boost performance. Empirical results on Stable Diffusion show improved human-aligned aesthetics and generalization across prompts and reward models, highlighting practical impact for more faithful image generation guided by nuanced human preferences.

Abstract

Direct preference optimization (DPO) has shown success in aligning diffusion models with human preference. Previous approaches typically assume a consistent preference label between final generations and noisy samples at intermediate steps, and directly apply DPO to these noisy samples for fine-tuning. However, we theoretically identify inherent issues in this assumption and its impacts on the effectiveness of preference alignment. We first demonstrate the inherent issues from two perspectives: gradient direction and preference order, and then propose a Tailored Preference Optimization (TailorPO) framework for aligning diffusion models with human preference, underpinned by some theoretical insights. Our approach directly ranks intermediate noisy samples based on their step-wise reward, and effectively resolves the gradient direction issues through a simple yet efficient design. Additionally, we incorporate the gradient guidance of diffusion models into preference alignment to further enhance the optimization effectiveness. Experimental results demonstrate that our method significantly improves the model's ability to generate aesthetically pleasing and human-preferred images.

Paper Structure

This paper contains 24 sections, 1 theorem, 24 equations, 13 figures, 10 tables, 1 algorithm.

Key Result

Proposition 1

Let a measurement $g(x_0)=\mathcal{A}(x_0)+n$, where $\mathcal{A}(\cdot)$ is a measure operator defined on images $x_0$ and $n\sim\mathcal{N}(0, \sigma^2 I)$ is the measurement noise. The Jensen gap between $\mathbb{E}[g(x_0)|c, x_t]$ and $g(\mathbb{E}[x_0|c,x_t])$, i.e.,$\mathcal{J}=\mathbb{E}[g(x_

Figures (13)

  • Figure 1: Framework overview of (a) previous method and (b) TailorPO. In the previous method, the preference order is determined based on final outputs and used to guide the optimization of intermediate noisy samples in different generation trajectories. In contrast, we generate noisy samples from the same input $x_t$ and directly rank their preference order for optimization.
  • Figure 1: Gradient guidance successfully increased/decreased the reward of most samples.
  • Figure 2: The preference order of intermediate noisy samples is not always consistent with the preference order of final output images, from both perspectives of the aesthetic score (red) and ImageReward score (blue).
  • Figure 3: Framework of TailorPO. At each step $t$, we start from the same $x_t$ to generate two noisy samples $x^0_{t-1}$ and $x^1_{t-1}$. Subsequently, we compare their step-wise reward to determine their preference order. For the preferred sample, if the reward model is differentiable, we employ the gradient guidance to further increase its reward to obtain $x^+_{t-1}$. Then, we optimize the generating probability of preferred and dis-preferred samples. After the optimization at step $t$, the preferred sample is taken as the input $x_{t-1}$ of the next step for later sampling and optimization.
  • Figure 4: The change curve of reward values during the fine-tuning process. Experiments were conducted for three runs and we plot the average value and standard deviation of the reward.
  • ...and 8 more figures

Theorems & Definitions (1)

  • Proposition 1: proven by chung2023diffusion