Bridging SFT and DPO for Diffusion Model Alignment with Self-Sampling Preference Optimization

Daoan Zhang; Guangchen Lan; Dong-Jun Han; Wenlin Yao; Xiaoman Pan; Hongming Zhang; Mingxiao Li; Pengcheng Chen; Yu Dong; Christopher Brinton; Jiebo Luo

Bridging SFT and DPO for Diffusion Model Alignment with Self-Sampling Preference Optimization

Daoan Zhang, Guangchen Lan, Dong-Jun Han, Wenlin Yao, Xiaoman Pan, Hongming Zhang, Mingxiao Li, Pengcheng Chen, Yu Dong, Christopher Brinton, Jiebo Luo

TL;DR

SSPO tackles the challenge of aligning diffusion-based text-to-visual models to human preferences without relying on reward models or large-scale paired feedback. It combines supervised fine-tuning stability with Direct Preference Optimization through Random Checkpoint Replay and Self-Sampling Regularization, enabling adaptive switching between learning signals. The approach achieves state-of-the-art or competitive results on text-to-image benchmarks and strong performance on text-to-video tasks, with theoretical justification for improved generalization. This work offers a scalable, RM-free pathway to higher-quality, human-aligned diffusion outputs.

Abstract

Existing post-training techniques are broadly categorized into supervised fine-tuning (SFT) and reinforcement learning (RL) methods; the former is stable during training but suffers from limited generalization, while the latter, despite its stronger generalization capability, relies on additional preference data or reward models and carries the risk of reward exploitation. In order to preserve the advantages of both SFT and RL -- namely, eliminating the need for paired data and reward models while retaining the training stability of SFT and the generalization ability of RL -- a new alignment method, Self-Sampling Preference Optimization (SSPO), is proposed in this paper. SSPO introduces a Random Checkpoint Replay (RCR) strategy that utilizes historical checkpoints to construct paired data, thereby effectively mitigating overfitting. Simultaneously, a Self-Sampling Regularization (SSR) strategy is employed to dynamically evaluate the quality of generated samples; when the generated samples are more likely to be winning samples, the approach automatically switches from DPO (Direct Preference Optimization) to SFT, ensuring that the training process accurately reflects the quality of the samples. Experimental results demonstrate that SSPO not only outperforms existing methods on text-to-image benchmarks, but its effectiveness has also been validated in text-to-video tasks. We validate SSPO across both text-to-image and text-to-video benchmarks. SSPO surpasses all previous approaches on the text-to-image benchmarks and demonstrates outstanding performance on the text-to-video benchmarks.

Bridging SFT and DPO for Diffusion Model Alignment with Self-Sampling Preference Optimization

TL;DR

Abstract

Paper Structure (30 sections, 3 theorems, 23 equations, 12 figures, 6 tables, 1 algorithm)

This paper contains 30 sections, 3 theorems, 23 equations, 12 figures, 6 tables, 1 algorithm.

Introduction
Related Work
Background
Denoising Diffusion Probabilistic Model
Direct Preference Optimization
Self-Sampling Preference Optimization
Random Checkpoint Replay
Experiments
Setup
Text-to-Image
Text-to-Video
Results
Analysis of Text-to-Image
Memory and Time Cost
Analysis of Text-to-Video
...and 15 more sections

Key Result

Theorem 4.1

With uniformly sampling from previous checkpoints $[0, K-1]$, the PAC-Bayesian upper bound of the generalization loss is guaranteed to be smaller or equal to the latest checkpoint. where ${L}(\pi_{\theta}) \coloneqq \mathop{\mathbb{E}}_{\mathbf{x} \sim \mathcal{D}} \mathcal{L}(\theta; \mathbf{x})$ is the generalization loss on the policy, $Q_{\rm UNI}$ is the posterior policy distribution with un

Figures (12)

Figure 1: Illustration of SFT, DPO, and the proposed SSPO. $x^{w}$ is the winning samples, and $x^{l}$ is the losing samples. $x^{\rm rand\_w}$ and $x^{\rm rand\_l}$ are the samples generated by randomly sampled checkpoints. SSPO can switch between SFT and DPO based on the current state of the datapoint.
Figure 2: Ablation study of the Experience Replay Distribution selection strategies. The x-axis is the number of training steps $T$. The y-axis is the testing score. (The higher is better.)
Figure 3: Correlation between SSR Rate and PickScore. There is a significant correlation between SSR Rate and PickScore.
Figure 4: Ablation study on checkpoint saving frequency. The checkpoint is saved every k steps. ($K=10/20/50$)
Figure 5: Text-to-image generation results of SD-1.5, SPO, SPIN-Diffusion and SSPO. The prompts from left to right are: (1) Photo of a pigeon in a well tailored suit getting a cup of coffee in a cafe in the morning; (2) Ginger Tabby cat watercolor with flowers; (3) An image of a peaceful mountain landscape at sunset, with a small cabin nestled in the trees and a winding river in the foreground; (4) Detailed Portrait Of A Disheveled Hippie Girl With Bright Gray Eyes By Anna Dittmann, Digital Painting, 120k, Ultra Hd, Hyper Detailed, Complimentary Colors, Wlop, Digital Painting; (5) b&w photo of 42 y.o man in white clothes, bald, face, half body, body, high detailed skin, skin pores, coastline, overcast weather.
...and 7 more figures

Theorems & Definitions (4)

Theorem 4.1
Theorem 4.2
Lemma A.1
Definition A.3

Bridging SFT and DPO for Diffusion Model Alignment with Self-Sampling Preference Optimization

TL;DR

Abstract

Bridging SFT and DPO for Diffusion Model Alignment with Self-Sampling Preference Optimization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (12)

Theorems & Definitions (4)