Table of Contents
Fetching ...

Rethinking Preference Alignment for Diffusion Models with Classifier-Free Guidance

Zhou Jiang, Yandong Wen, Zhen Liu

TL;DR

The paper reframes diffusion-model alignment to human preferences as classifier-free guidance (CFG), enabling inference-time control without heavy base-model retraining. It introduces Preference-Guided Diffusion (PGD) and its contrastive variant (cPGD), where a finetuned positive/negative signal guides sampling to sharpen alignment while preserving diversity. The authors ground cPGD in a maximum-entropy and Bradley–Terry framework, derive a practical estimator, and show that simple Taylor-based merging can collapse multi-model guidance into a single checkpoint without sacrificing performance. Experiments on Stable Diffusion 1.5 and SDXL with Pick-a-Pic v2 and HPDv3 demonstrate consistent improvements in reward proxies, diversity, and user preference alignment, with plug-and-play transfer to other architectures. The work offers a lightweight, scalable path to better alignment, enabling broader deployment of preference-aware diffusion systems.

Abstract

Aligning large-scale text-to-image diffusion models with nuanced human preferences remains challenging. While direct preference optimization (DPO) is simple and effective, large-scale finetuning often shows a generalization gap. We take inspiration from test-time guidance and cast preference alignment as classifier-free guidance (CFG): a finetuned preference model acts as an external control signal during sampling. Building on this view, we propose a simple method that improves alignment without retraining the base model. To further enhance generalization, we decouple preference learning into two modules trained on positive and negative data, respectively, and form a \emph{contrastive guidance} vector at inference by subtracting their predictions (positive minus negative), scaled by a user-chosen strength and added to the base prediction at each step. This yields a sharper and controllable alignment signal. We evaluate on Stable Diffusion 1.5 and Stable Diffusion XL with Pick-a-Pic v2 and HPDv3, showing consistent quantitative and qualitative gains.

Rethinking Preference Alignment for Diffusion Models with Classifier-Free Guidance

TL;DR

The paper reframes diffusion-model alignment to human preferences as classifier-free guidance (CFG), enabling inference-time control without heavy base-model retraining. It introduces Preference-Guided Diffusion (PGD) and its contrastive variant (cPGD), where a finetuned positive/negative signal guides sampling to sharpen alignment while preserving diversity. The authors ground cPGD in a maximum-entropy and Bradley–Terry framework, derive a practical estimator, and show that simple Taylor-based merging can collapse multi-model guidance into a single checkpoint without sacrificing performance. Experiments on Stable Diffusion 1.5 and SDXL with Pick-a-Pic v2 and HPDv3 demonstrate consistent improvements in reward proxies, diversity, and user preference alignment, with plug-and-play transfer to other architectures. The work offers a lightweight, scalable path to better alignment, enabling broader deployment of preference-aware diffusion systems.

Abstract

Aligning large-scale text-to-image diffusion models with nuanced human preferences remains challenging. While direct preference optimization (DPO) is simple and effective, large-scale finetuning often shows a generalization gap. We take inspiration from test-time guidance and cast preference alignment as classifier-free guidance (CFG): a finetuned preference model acts as an external control signal during sampling. Building on this view, we propose a simple method that improves alignment without retraining the base model. To further enhance generalization, we decouple preference learning into two modules trained on positive and negative data, respectively, and form a \emph{contrastive guidance} vector at inference by subtracting their predictions (positive minus negative), scaled by a user-chosen strength and added to the base prediction at each step. This yields a sharper and controllable alignment signal. We evaluate on Stable Diffusion 1.5 and Stable Diffusion XL with Pick-a-Pic v2 and HPDv3, showing consistent quantitative and qualitative gains.
Paper Structure (45 sections, 42 equations, 27 figures, 12 tables)

This paper contains 45 sections, 42 equations, 27 figures, 12 tables.

Figures (27)

  • Figure 1: Toy 2D experiment on DPO (top) and our proposed PGD (below) to demonstrate the overfitting issue in DPO training. Black circles indicate positive sample clusters and red crosses indicate negative sample clusters. $w$ is the guidance weight of PGD and $\beta$ is the DPO scale parameter.
  • Figure 2: Comparison of base, DPO, and PGD: PGD retains base fidelity while leveraging DPO-learned preferences.
  • Figure 3: Illustration of cPGD. pSFT and nSFT denote inference with the model finetuned on positive and negative samples, respectively.
  • Figure 4: Comparison of preference-optimization methods on SDXL. Columns show outputs from the base model (SDXL), DPO, MaPO, NPO, PGD, and cPGD. PGD and cPGD achieves the highest rewards and is the most effective in aligning with human preference implied in the Pick-a-Pic v2 dataset.
  • Figure 5: Overall comparison on SDXL. Radar axes report mean scores (higher is better): PickScore (PS), HPSv2, HPSv3, CLIP, and ImageReward (IR). Polygons closer to the outer rim indicate better aggregate performance across metrics.
  • ...and 22 more figures