Table of Contents
Fetching ...

Preference Alignment for Diffusion Model via Explicit Denoised Distribution Estimation

Dingyuan Shi, Yong Wang, Hangyu Li, Xiangxiang Chu

TL;DR

This work introduces Denoised Distribution Estimation (DDE), a direct preference optimization framework for diffusion models that overcomes the terminal-only labeling challenge by explicitly linking intermediate denoising steps to the terminal distribution $p_\theta(x_0)$. It proposes two complementary estimation strategies—stepwise estimation for the upper trajectory segment and single-shot DDIM-based estimation for the final segment—together forming a unified loss that naturally assigns credit to the middle denoising steps. The method is shown to be effective and efficient, achieving state-of-the-art quantitative and qualitative results on SD15 and SDXL without auxiliary reward models. The findings highlight a principled way to prioritize middle-trajectory optimization, with broad implications for preference alignment in diffusion-based generation systems.

Abstract

Diffusion models have shown remarkable success in text-to-image generation, making preference alignment for these models increasingly important. The preference labels are typically available only at the terminal of denoising trajectories, which poses challenges in optimizing the intermediate denoising steps. In this paper, we propose to conduct Denoised Distribution Estimation (DDE) that explicitly connects intermediate steps to the terminal denoised distribution. Therefore, preference labels can be used for the entire trajectory optimization. To this end, we design two estimation strategies for our DDE. The first is stepwise estimation, which utilizes the conditional denoised distribution to estimate the model denoised distribution. The second is single-shot estimation, which converts the model output into the terminal denoised distribution via DDIM modeling. Analytically and empirically, we reveal that DDE equipped with two estimation strategies naturally derives a novel credit assignment scheme that prioritizes optimizing the middle part of the denoising trajectory. Extensive experiments demonstrate that our approach achieves superior performance, both quantitatively and qualitatively.

Preference Alignment for Diffusion Model via Explicit Denoised Distribution Estimation

TL;DR

This work introduces Denoised Distribution Estimation (DDE), a direct preference optimization framework for diffusion models that overcomes the terminal-only labeling challenge by explicitly linking intermediate denoising steps to the terminal distribution . It proposes two complementary estimation strategies—stepwise estimation for the upper trajectory segment and single-shot DDIM-based estimation for the final segment—together forming a unified loss that naturally assigns credit to the middle denoising steps. The method is shown to be effective and efficient, achieving state-of-the-art quantitative and qualitative results on SD15 and SDXL without auxiliary reward models. The findings highlight a principled way to prioritize middle-trajectory optimization, with broad implications for preference alignment in diffusion-based generation systems.

Abstract

Diffusion models have shown remarkable success in text-to-image generation, making preference alignment for these models increasingly important. The preference labels are typically available only at the terminal of denoising trajectories, which poses challenges in optimizing the intermediate denoising steps. In this paper, we propose to conduct Denoised Distribution Estimation (DDE) that explicitly connects intermediate steps to the terminal denoised distribution. Therefore, preference labels can be used for the entire trajectory optimization. To this end, we design two estimation strategies for our DDE. The first is stepwise estimation, which utilizes the conditional denoised distribution to estimate the model denoised distribution. The second is single-shot estimation, which converts the model output into the terminal denoised distribution via DDIM modeling. Analytically and empirically, we reveal that DDE equipped with two estimation strategies naturally derives a novel credit assignment scheme that prioritizes optimizing the middle part of the denoising trajectory. Extensive experiments demonstrate that our approach achieves superior performance, both quantitatively and qualitatively.

Paper Structure

This paper contains 21 sections, 16 equations, 7 figures, 4 tables, 1 algorithm.

Figures (7)

  • Figure 1: Comparison between previous methods and our DDE. The superscripts "$w$" and "$l$" denote winning and losing samples of a preference pair, respectively. Previous methods ignore the connections among denoising steps, hence making the optimization heavily rely on credit assignment scheme of terminal preferences signals. In contrast, our DDE approach explicitly estimates the terminal distribution from any given step $t$, thereby naturally deriving a scheme that enables direct optimization for the preference labels.
  • Figure 2: Overall framework of DDE. The training process is outlined as follows: 1) Sample random noises ($x_T^w, x_T^l$), winning and losing samples ($x_0^w, x_0^l$) and a denoising step $t$; 2) Conduct stepwise estimation from $T\rightarrow t$. By using $\exp\{r_k\} q(x_{k}|x_{k+1}, x_0)$ as an estimation to $p_\theta(x_k|x_{k+1})$ for all $t\le k \le T-1$, the cumulative product of denoising steps from $T$ to $t$ is estimated as $\exp\{\sum_{k=t}^{T-1} r_k\} q(x_t|x_0)$; 3) Apply single-shot estimation from $t \rightarrow 0$. By using DDIM, $p_\theta(x_{t-1}|x_t)$ is converted to $p_\theta(\hat{x}_{0}|x_t)$ with one single model calculation; 4) Leverage the preference label on $x_0$ for training. Additionally, in step 3, $p_\theta(x_{t-1}|x_t)$ is used to calculate non-gradient calibration coefficients $r_{t-1} = \log \frac{p_\theta(x_{t-1}|x_t)}{q(x_{t-1}|x_t, x_0)}$. These coefficients are updated using an exponential moving average for subsequent iterations.
  • Figure 3: Our model generates images with better detail, structure, and text-alignment, compared to the SD15 model. Specifically, it can generate a panda bear with tubes and flasks, which better aligns with "scientist" in the prompt. Additionally, the head portrait has more accurate eye detail and the rabbit is wearing an armor. We can generate images containing items as prompt requested (e.g. surfboard, vases). The structure can be maintained (e.g. sofa and superhero) and the background is more detailed (e.g. the cyberpunk cars).
  • Figure 4: Our model generates images with better detail, structure, text-alignment than SDXL model. We can generate building retaining window details and human hands with the right number of fingers. The dancing body structure can be kept and the requested text as well as the reflection in the water can be correctly generated. We generate a pigeon wearing a suit specified by the prompt.
  • Figure 5: (a) The values of the correction terms and DDIM coefficients, show that larger values result in weakening the effectiveness of denoising optimization at both boundaries. (b) The convergence behavior of the EMA update for calibration coefficients $r_t$, indicates stabilization within 100 training iterations. (c) A comparative analysis of optimization across various steps, reveals that optimizing only half of the steps yields performance comparable to that achieved by optimizing full steps.
  • ...and 2 more figures