Table of Contents
Fetching ...

Generating on Generated: An Approach Towards Self-Evolving Diffusion Models

Xulu Zhang, Xiaoyong Wei, Jinlin Wu, Jiaxin Wu, Zhaoxiang Zhang, Zhen Lei, Qing Li

TL;DR

RSIDiff tackles the degeneration observed when self-training diffusion models with synthetic data by identifying lack of perceptual alignment and accumulation of generative hallucinations as root causes. It introduces three components—a prompt construction pipeline to enhance perceptual alignment, a preference sampling mechanism to filter hallucinations, and a distribution-based weighting scheme to curb out-of-distribution errors—to enable robust recursive self-improvement. Across experiments with Stable Diffusion 1.4 and SD3 on PartiPrompts and HPD datasets, RSIDiff yields consistent improvements in human-aligned metrics, peaking around training round 6 before diminishing gains due to residual errors. The work demonstrates a practical path toward self-evolving diffusion models with greater robustness and alignment to human preferences, paving the way for future RSI diffusion research.

Abstract

Recursive Self-Improvement (RSI) enables intelligence systems to autonomously refine their capabilities. This paper explores the application of RSI in text-to-image diffusion models, addressing the challenge of training collapse caused by synthetic data. We identify two key factors contributing to this collapse: the lack of perceptual alignment and the accumulation of generative hallucinations. To mitigate these issues, we propose three strategies: (1) a prompt construction and filtering pipeline designed to facilitate the generation of perceptual aligned data, (2) a preference sampling method to identify human-preferred samples and filter out generative hallucinations, and (3) a distribution-based weighting scheme to penalize selected samples with hallucinatory errors. Our extensive experiments validate the effectiveness of these approaches.

Generating on Generated: An Approach Towards Self-Evolving Diffusion Models

TL;DR

RSIDiff tackles the degeneration observed when self-training diffusion models with synthetic data by identifying lack of perceptual alignment and accumulation of generative hallucinations as root causes. It introduces three components—a prompt construction pipeline to enhance perceptual alignment, a preference sampling mechanism to filter hallucinations, and a distribution-based weighting scheme to curb out-of-distribution errors—to enable robust recursive self-improvement. Across experiments with Stable Diffusion 1.4 and SD3 on PartiPrompts and HPD datasets, RSIDiff yields consistent improvements in human-aligned metrics, peaking around training round 6 before diminishing gains due to residual errors. The work demonstrates a practical path toward self-evolving diffusion models with greater robustness and alignment to human preferences, paving the way for future RSI diffusion research.

Abstract

Recursive Self-Improvement (RSI) enables intelligence systems to autonomously refine their capabilities. This paper explores the application of RSI in text-to-image diffusion models, addressing the challenge of training collapse caused by synthetic data. We identify two key factors contributing to this collapse: the lack of perceptual alignment and the accumulation of generative hallucinations. To mitigate these issues, we propose three strategies: (1) a prompt construction and filtering pipeline designed to facilitate the generation of perceptual aligned data, (2) a preference sampling method to identify human-preferred samples and filter out generative hallucinations, and (3) a distribution-based weighting scheme to penalize selected samples with hallucinatory errors. Our extensive experiments validate the effectiveness of these approaches.

Paper Structure

This paper contains 19 sections, 6 equations, 16 figures, 2 tables, 1 algorithm.

Figures (16)

  • Figure 1: We introduce RSIDiff, a novel approach that enhances the performance of diffusion models through recursive self-training. By iteratively refining the model with its own generated data, RSIDiff produces images of noteworthy aesthetic quality.
  • Figure 2: Degeneration in RSI. We observe a severe domain shift and decline in image fidelity when fine-tuning diffusion models with self-generated data. The diffusion model gradually loses the ability to generate fine-grained details.
  • Figure 3: Framework of RSIDiff. (a) We crawl Prompts from user-active image synthesis website and filter them based on clarity, specificity, and diversity; (b) We employ preference sampling, which utilizes automatic metrics to identify human-preferred images; (c) We use the distribution-based weighting strategy to penalize out-of-distribution samples; and (d) We fine-tune the diffusion model with the selected samples and start a new training round.
  • Figure 4: Examples generated by SD v1.4 with a simple prompt (left) and our filtered prompt (right). The latter produces a more visually appealing image.
  • Figure 5: Quantitative Results. Performance comparison with the base model and SFT method across two datasets and 4 evaluation metrics. The results show that RSIDiff significantly outperforms the base model and achieves consistent improvements.
  • ...and 11 more figures