Generating on Generated: An Approach Towards Self-Evolving Diffusion Models
Xulu Zhang, Xiaoyong Wei, Jinlin Wu, Jiaxin Wu, Zhaoxiang Zhang, Zhen Lei, Qing Li
TL;DR
RSIDiff tackles the degeneration observed when self-training diffusion models with synthetic data by identifying lack of perceptual alignment and accumulation of generative hallucinations as root causes. It introduces three components—a prompt construction pipeline to enhance perceptual alignment, a preference sampling mechanism to filter hallucinations, and a distribution-based weighting scheme to curb out-of-distribution errors—to enable robust recursive self-improvement. Across experiments with Stable Diffusion 1.4 and SD3 on PartiPrompts and HPD datasets, RSIDiff yields consistent improvements in human-aligned metrics, peaking around training round 6 before diminishing gains due to residual errors. The work demonstrates a practical path toward self-evolving diffusion models with greater robustness and alignment to human preferences, paving the way for future RSI diffusion research.
Abstract
Recursive Self-Improvement (RSI) enables intelligence systems to autonomously refine their capabilities. This paper explores the application of RSI in text-to-image diffusion models, addressing the challenge of training collapse caused by synthetic data. We identify two key factors contributing to this collapse: the lack of perceptual alignment and the accumulation of generative hallucinations. To mitigate these issues, we propose three strategies: (1) a prompt construction and filtering pipeline designed to facilitate the generation of perceptual aligned data, (2) a preference sampling method to identify human-preferred samples and filter out generative hallucinations, and (3) a distribution-based weighting scheme to penalize selected samples with hallucinatory errors. Our extensive experiments validate the effectiveness of these approaches.
