SAIL: Self-Amplified Iterative Learning for Diffusion Model Alignment with Minimal Human Feedback

Xiaoxuan He; Siming Fu; Wanli Li; Zhiyuan Li; Dacheng Yin; Kang Rong; Fengyun Rao; Bo Zhang

SAIL: Self-Amplified Iterative Learning for Diffusion Model Alignment with Minimal Human Feedback

Xiaoxuan He, Siming Fu, Wanli Li, Zhiyuan Li, Dacheng Yin, Kang Rong, Fengyun Rao, Bo Zhang

TL;DR

The paper addresses the challenge of aligning diffusion models to human preferences when large-scale human data or external reward models are impractical. It introduces SAIL, a self-amplified iterative learning framework where the model bootstraps from a small seed of human preferences to generate, evaluate, and refine its own outputs in a closed loop, aided by a ranked mixup strategy to prevent catastrophic forgetting. The key contributions are the first implicit self-rewarding alignment framework, a mixup-based data strategy for stable self-improvement, and extensive experiments showing SAIL outperforms state-of-the-art methods on multiple benchmarks with only ~6% of typical preference data. The results demonstrate that diffusion models possess latent self-improvement capabilities that, when harnessed properly, can reduce reliance on large human annotations and external reward models, enabling more scalable and robust alignment. The work suggests promising extensions to other modalities (e.g., video) and highlights SAIL as a practical pathway toward bias-resistant, data-efficient alignment in real-world applications.

Abstract

Aligning diffusion models with human preferences remains challenging, particularly when reward models are unavailable or impractical to obtain, and collecting large-scale preference datasets is prohibitively expensive. \textit{This raises a fundamental question: can we achieve effective alignment using only minimal human feedback, without auxiliary reward models, by unlocking the latent capabilities within diffusion models themselves?} In this paper, we propose \textbf{SAIL} (\textbf{S}elf-\textbf{A}mplified \textbf{I}terative \textbf{L}earning), a novel framework that enables diffusion models to act as their own teachers through iterative self-improvement. Starting from a minimal seed set of human-annotated preference pairs, SAIL operates in a closed-loop manner where the model progressively generates diverse samples, self-annotates preferences based on its evolving understanding, and refines itself using this self-augmented dataset. To ensure robust learning and prevent catastrophic forgetting, we introduce a ranked preference mixup strategy that carefully balances exploration with adherence to initial human priors. Extensive experiments demonstrate that SAIL consistently outperforms state-of-the-art methods across multiple benchmarks while using merely 6\% of the preference data required by existing approaches, revealing that diffusion models possess remarkable self-improvement capabilities that, when properly harnessed, can effectively replace both large-scale human annotation and external reward models.

SAIL: Self-Amplified Iterative Learning for Diffusion Model Alignment with Minimal Human Feedback

TL;DR

Abstract

Paper Structure (15 sections, 9 equations, 4 figures, 6 tables, 1 algorithm)

This paper contains 15 sections, 9 equations, 4 figures, 6 tables, 1 algorithm.

Introduction
Related Work
Human Preference Optimization
Online Direct Preference Optimization
Preliminary
Method
Self-Rewarding Perference Ranking With Self-Generated Data
Closed-Loop Boosting Diffusion Model with Mixup Ranked Preference Data
Experimental Results
Experiment Settings
Primary Results: Aligning Diffusion Models
Initial on large seed data
Comparsion with Online DPO
Ablation Study
Conclusion

Figures (4)

Figure 1: Comparsion of three direct preference optimization methods. Different from Offline DPO and Online DPO, SAIL iteratively update without large preference dataset and external reward model.
Figure 2: Iterative performance improvement with generated data of SAIL on Pick-a-Pic validation dataset in Aesthetics, ImageReward, and HPSv2. During the iterative process, SAIL demonstrates steady improvement and ultimately surpassed DiffusionDPO (as indicated by the dashed line).
Figure 3: Illustration of the proposed SAIL framework. The SAIL framework incrementally refines the alignment of diffusion models through iterative cycles consisting of generating new preference data and conducting preference learning using mixup ranked preference data complemented by self-refinement mechanisms. This closed-loop self-boosting process operates with minimal initial data input, aiming to optimize performance by capitalizing on the intrinsic capabilities of the model, independent of external reward systems.
Figure 4: The qualitative results demonstrate the effectiveness of our method.

SAIL: Self-Amplified Iterative Learning for Diffusion Model Alignment with Minimal Human Feedback

TL;DR

Abstract

SAIL: Self-Amplified Iterative Learning for Diffusion Model Alignment with Minimal Human Feedback

Authors

TL;DR

Abstract

Table of Contents

Figures (4)