You Only Sample Once: Taming One-Step Text-to-Image Synthesis by Self-Cooperative Diffusion GANs

Yihong Luo; Xiaolong Chen; Xinghua Qu; Tianyang Hu; Jing Tang

You Only Sample Once: Taming One-Step Text-to-Image Synthesis by Self-Cooperative Diffusion GANs

Yihong Luo, Xiaolong Chen, Xinghua Qu, Tianyang Hu, Jing Tang

TL;DR

YOSO addresses the slow generation of diffusion models by enabling high-quality one-step image synthesis through a self-cooperative diffusion-GAN framework that smooths adversarial divergence via the denoising generator. The method trains from scratch and extends to one-step text-to-image synthesis by leveraging latent perceptual loss, latent discriminators, informative prior initialization, and a quick adaption stage to fix noisy schedulers, achieving state-of-the-art one-step performance and efficient fine-tuning with LoRA. Empirical results on CIFAR-10 and text-to-image benchmarks demonstrate rapid convergence, strong mode coverage, and robustness, including zero-shot 1024-resolution generation. The approach supports downstream tasks like image editing and compatibility with ControlNet, and the authors provide code to facilitate adoption and further research.

Abstract

Recently, some works have tried to combine diffusion and Generative Adversarial Networks (GANs) to alleviate the computational cost of the iterative denoising inference in Diffusion Models (DMs). However, existing works in this line suffer from either training instability and mode collapse or subpar one-step generation learning efficiency. To address these issues, we introduce YOSO, a novel generative model designed for rapid, scalable, and high-fidelity one-step image synthesis with high training stability and mode coverage. Specifically, we smooth the adversarial divergence by the denoising generator itself, performing self-cooperative learning. We show that our method can serve as a one-step generation model training from scratch with competitive performance. Moreover, we extend our YOSO to one-step text-to-image generation based on pre-trained models by several effective training techniques (i.e., latent perceptual loss and latent discriminator for efficient training along with the latent DMs; the informative prior initialization (IPI), and the quick adaption stage for fixing the flawed noise scheduler). Experimental results show that YOSO achieves the state-of-the-art one-step generation performance even with Low-Rank Adaptation (LoRA) fine-tuning. In particular, we show that the YOSO-PixArt-$α$ can generate images in one step trained on 512 resolution, with the capability of adapting to 1024 resolution without extra explicit training, requiring only ~10 A800 days for fine-tuning. Our code is provided at https://github.com/Luo-Yihong/YOSO.

You Only Sample Once: Taming One-Step Text-to-Image Synthesis by Self-Cooperative Diffusion GANs

TL;DR

Abstract

can generate images in one step trained on 512 resolution, with the capability of adapting to 1024 resolution without extra explicit training, requiring only ~10 A800 days for fine-tuning. Our code is provided at https://github.com/Luo-Yihong/YOSO.

Paper Structure (29 sections, 1 theorem, 8 equations, 12 figures, 7 tables, 1 algorithm)

This paper contains 29 sections, 1 theorem, 8 equations, 12 figures, 7 tables, 1 algorithm.

Introduction
Background
Method: Self-Cooperative Diffusion GANs
Our Design
Try It On CIFAR-10 Before Scaling Up For Saving Money!
training strategies
Empirical Evaluation
Towards One-Step Text-to-Image Synthesis
Using pre-trained models for Training
Fixing the Noise Scheduler
Experiments
Text-to-Image Generation
Zero-Shot One-Step 1024 Resolution Generation
Ablation Studies
Application
...and 14 more sections

Key Result

Proposition 1

The optimal solution of the cooperative adversarial loss reaches $p^{(T)}_\theta({\mathbf{x}}) = p_d({\mathbf{x}})$.

Figures (12)

Figure 1: One-step generated images by YOSO under different configurations (Bottom). The model is trained by fine-tuning PixArt-$\alpha$chen2024pixartalpha on 512 resolution with our proposed algorithm. Bottom Left is generated by YOSO adapting to 1024 resolution with \ref{['eq:linear_comb']} without extra explicit training.
Figure 2: Samples by YOSO-LoRA with one-step inference from different initialization.
Figure 3: Predicting $\epsilon$ fails.
Figure 3: Ablation study on CIFAR-10 with smaller backbone.
Figure 4: Qualitative comparisons of YOSO against competing methods. NFE denotes the Number of Function Evaluations.
...and 7 more figures

Theorems & Definitions (1)

Proposition 1

You Only Sample Once: Taming One-Step Text-to-Image Synthesis by Self-Cooperative Diffusion GANs

TL;DR

Abstract

You Only Sample Once: Taming One-Step Text-to-Image Synthesis by Self-Cooperative Diffusion GANs

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (12)

Theorems & Definitions (1)