Table of Contents
Fetching ...

PC-Diffusion: Aligning Diffusion Models with Human Preferences via Preference Classifier

Shaomeng Wang, He Wang, Xiaolu Wei, Longquan Dai, Jinhui Tang

TL;DR

PC-Diffusion tackles the misalignment between diffusion-generated outputs and human preferences by introducing a lightweight Preference Classifier that guides generation without updating the base model or relying on a reference policy. The authors prove that the preference-guided propagation across timesteps remains consistent and that the training objective is equivalent to a reference-free version of Direct Preference Optimization. Empirically, PC-Diffusion achieves comparable or better preference consistency to DPO while substantially reducing training cost and improving stability. This approach offers a practical, scalable path to human-aligned diffusion synthesis across aesthetics, text-to-image, and conditioning tasks.

Abstract

Diffusion models have achieved remarkable success in conditional image generation, yet their outputs often remain misaligned with human preferences. To address this, recent work has applied Direct Preference Optimization (DPO) to diffusion models, yielding significant improvements.~However, DPO-like methods exhibit two key limitations: 1) High computational cost,due to the entire model fine-tuning; 2) Sensitivity to reference model quality}, due to its tendency to introduce instability and bias. To overcome these limitations, we propose a novel framework for human preference alignment in diffusion models (PC-Diffusion), using a lightweight, trainable Preference Classifier that directly models the relative preference between samples. By restricting preference learning to this classifier, PC-Diffusion decouples preference alignment from the generative model, eliminating the need for entire model fine-tuning and reference model reliance.~We further provide theoretical guarantees for PC-Diffusion:1) PC-Diffusion ensures that the preference-guided distributions are consistently propagated across timesteps. 2)The training objective of the preference classifier is equivalent to DPO, but does not require a reference model.3) The proposed preference-guided correction can progressively steer generation toward preference-aligned regions.~Empirical results show that PC-Diffusion achieves comparable preference consistency to DPO while significantly reducing training costs and enabling efficient and stable preference-guided generation.

PC-Diffusion: Aligning Diffusion Models with Human Preferences via Preference Classifier

TL;DR

PC-Diffusion tackles the misalignment between diffusion-generated outputs and human preferences by introducing a lightweight Preference Classifier that guides generation without updating the base model or relying on a reference policy. The authors prove that the preference-guided propagation across timesteps remains consistent and that the training objective is equivalent to a reference-free version of Direct Preference Optimization. Empirically, PC-Diffusion achieves comparable or better preference consistency to DPO while substantially reducing training cost and improving stability. This approach offers a practical, scalable path to human-aligned diffusion synthesis across aesthetics, text-to-image, and conditioning tasks.

Abstract

Diffusion models have achieved remarkable success in conditional image generation, yet their outputs often remain misaligned with human preferences. To address this, recent work has applied Direct Preference Optimization (DPO) to diffusion models, yielding significant improvements.~However, DPO-like methods exhibit two key limitations: 1) High computational cost,due to the entire model fine-tuning; 2) Sensitivity to reference model quality}, due to its tendency to introduce instability and bias. To overcome these limitations, we propose a novel framework for human preference alignment in diffusion models (PC-Diffusion), using a lightweight, trainable Preference Classifier that directly models the relative preference between samples. By restricting preference learning to this classifier, PC-Diffusion decouples preference alignment from the generative model, eliminating the need for entire model fine-tuning and reference model reliance.~We further provide theoretical guarantees for PC-Diffusion:1) PC-Diffusion ensures that the preference-guided distributions are consistently propagated across timesteps. 2)The training objective of the preference classifier is equivalent to DPO, but does not require a reference model.3) The proposed preference-guided correction can progressively steer generation toward preference-aligned regions.~Empirical results show that PC-Diffusion achieves comparable preference consistency to DPO while significantly reducing training costs and enabling efficient and stable preference-guided generation.

Paper Structure

This paper contains 17 sections, 3 theorems, 11 equations, 4 figures, 2 tables, 1 algorithm.

Key Result

Theorem 1

(Proof in the supplementary material) Given a preference classifier $\mathcal{S}_\theta(x)$, suppose that $x_t$ is sampled from $\hat{p}_\phi^\theta(x_t)$, and then apply the transition $\hat{p}_\phi^\theta(x_{t-1} \mid x_t)$, we can generate: where $N_{t-1}$ is normalization term.

Figures (4)

  • Figure 1: Relationship between PC-Diffusion, DPO, and DDPM. Nodes: top row ($t$) and bottom row ($t\!-\!1$) list, from left to right, the preference-guided process $\hat{p}_\phi^\theta(\cdot)$, the standard DDPM process $p_\phi(\cdot)$, and the DPO process $\hat{p}_\phi(\cdot)$. Horizontal arrows: method transformations at the same timestep—PC-Diffusion steers DDPM toward human-preferred distributions, while DPO maps DDPM to the DPO process. Vertical arrows: transitions $x_t\!\to\! x_{t-1}$ within each column: PC-Diffusion in Eq. \ref{['eq:pc_transition']}, DDPM in Eq. \ref{['eq:ddpm-sampling']}, and DPO transitions.
  • Figure 2: Qualitative Comparison for Aesthetic Alignment. We present an alternative alignment approach based on PC-Diffusion, which steers pretrained diffusion models toward human-preferred outputs via a preference-guided correction term, without fine-tuning the entire model. Applied to SD1.5 rombach2022high, PC-Diffusion achieves superior visual and textual alignment compared to baselines including SD1.5-Base, SD1.5-DPO wallace2024diffusion, SD1.5-KTO li2024aligning, SD1.5-SPO liang2024step.
  • Figure 3: Qualitative Comparison of Text-Image Alignment. This figure showcases images generated by PC-Diffusion and other methods (e.g., SD1.5-DPO, SD1.5-KTO, SD1.5-SPO) for the text prompts listed on the left. PC-Diffusion achieves enhanced semantic fidelity, particularly in accurately elements specified in the prompt, such as color and object positioning. Yellow boxes: Regions matched to prompts. Red boxes: Regions mismatched to prompts.
  • Figure 4: Comparison of existing controllable diffusion models on different conditions. This figure illustrates a qualitative comparison of our proposed PC-Diffusion against other models, including ControlNet zhang2023adding, Gligen li2023gligen, T2I-Adapter mou2024t2i, UniControl qin2023unicontrol, Uni-ControlNet zhao2023uni, and ControlNet++ li2024controlnet++. Notably, PC-Diffusion more accurately follows the input conditions compared to baseline methods. Unsupported: The method does not provide a model for image generation.

Theorems & Definitions (4)

  • Definition 1: Preference Classifier Guidance
  • Theorem 1
  • Theorem 2
  • Theorem 3