SAIL: Self-Amplified Iterative Learning for Diffusion Model Alignment with Minimal Human Feedback
Xiaoxuan He, Siming Fu, Wanli Li, Zhiyuan Li, Dacheng Yin, Kang Rong, Fengyun Rao, Bo Zhang
TL;DR
The paper addresses the challenge of aligning diffusion models to human preferences when large-scale human data or external reward models are impractical. It introduces SAIL, a self-amplified iterative learning framework where the model bootstraps from a small seed of human preferences to generate, evaluate, and refine its own outputs in a closed loop, aided by a ranked mixup strategy to prevent catastrophic forgetting. The key contributions are the first implicit self-rewarding alignment framework, a mixup-based data strategy for stable self-improvement, and extensive experiments showing SAIL outperforms state-of-the-art methods on multiple benchmarks with only ~6% of typical preference data. The results demonstrate that diffusion models possess latent self-improvement capabilities that, when harnessed properly, can reduce reliance on large human annotations and external reward models, enabling more scalable and robust alignment. The work suggests promising extensions to other modalities (e.g., video) and highlights SAIL as a practical pathway toward bias-resistant, data-efficient alignment in real-world applications.
Abstract
Aligning diffusion models with human preferences remains challenging, particularly when reward models are unavailable or impractical to obtain, and collecting large-scale preference datasets is prohibitively expensive. \textit{This raises a fundamental question: can we achieve effective alignment using only minimal human feedback, without auxiliary reward models, by unlocking the latent capabilities within diffusion models themselves?} In this paper, we propose \textbf{SAIL} (\textbf{S}elf-\textbf{A}mplified \textbf{I}terative \textbf{L}earning), a novel framework that enables diffusion models to act as their own teachers through iterative self-improvement. Starting from a minimal seed set of human-annotated preference pairs, SAIL operates in a closed-loop manner where the model progressively generates diverse samples, self-annotates preferences based on its evolving understanding, and refines itself using this self-augmented dataset. To ensure robust learning and prevent catastrophic forgetting, we introduce a ranked preference mixup strategy that carefully balances exploration with adherence to initial human priors. Extensive experiments demonstrate that SAIL consistently outperforms state-of-the-art methods across multiple benchmarks while using merely 6\% of the preference data required by existing approaches, revealing that diffusion models possess remarkable self-improvement capabilities that, when properly harnessed, can effectively replace both large-scale human annotation and external reward models.
