Table of Contents
Fetching ...

Diffusion Models Need Visual Priors for Image Generation

Xiaoyu Yue, Zidong Wang, Zeyu Lu, Shuyang Sun, Meng Wei, Wanli Ouyang, Lei Bai, Luping Zhou

TL;DR

This work proposes Diffusion on Diffusion (DoD), an innovative multi-stage generation framework that first extracts visual priors from previously generated samples, then provides rich guidance for the diffusion model leveraging visual priors from the early stages of diffusion sampling.

Abstract

Conventional class-guided diffusion models generally succeed in generating images with correct semantic content, but often struggle with texture details. This limitation stems from the usage of class priors, which only provide coarse and limited conditional information. To address this issue, we propose Diffusion on Diffusion (DoD), an innovative multi-stage generation framework that first extracts visual priors from previously generated samples, then provides rich guidance for the diffusion model leveraging visual priors from the early stages of diffusion sampling. Specifically, we introduce a latent embedding module that employs a compression-reconstruction approach to discard redundant detail information from the conditional samples in each stage, retaining only the semantic information for guidance. We evaluate DoD on the popular ImageNet-$256 \times 256$ dataset, reducing 7$\times$ training cost compared to SiT and DiT with even better performance in terms of the FID-50K score. Our largest model DoD-XL achieves an FID-50K score of 1.83 with only 1 million training steps, which surpasses other state-of-the-art methods without bells and whistles during inference.

Diffusion Models Need Visual Priors for Image Generation

TL;DR

This work proposes Diffusion on Diffusion (DoD), an innovative multi-stage generation framework that first extracts visual priors from previously generated samples, then provides rich guidance for the diffusion model leveraging visual priors from the early stages of diffusion sampling.

Abstract

Conventional class-guided diffusion models generally succeed in generating images with correct semantic content, but often struggle with texture details. This limitation stems from the usage of class priors, which only provide coarse and limited conditional information. To address this issue, we propose Diffusion on Diffusion (DoD), an innovative multi-stage generation framework that first extracts visual priors from previously generated samples, then provides rich guidance for the diffusion model leveraging visual priors from the early stages of diffusion sampling. Specifically, we introduce a latent embedding module that employs a compression-reconstruction approach to discard redundant detail information from the conditional samples in each stage, retaining only the semantic information for guidance. We evaluate DoD on the popular ImageNet- dataset, reducing 7 training cost compared to SiT and DiT with even better performance in terms of the FID-50K score. Our largest model DoD-XL achieves an FID-50K score of 1.83 with only 1 million training steps, which surpasses other state-of-the-art methods without bells and whistles during inference.

Paper Structure

This paper contains 15 sections, 11 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Selected samples generated by the second stage of DoD-XL. By training for only $1$ million steps on ImageNet-$256\times256$ dataset, DoD-XL achieves state-of-the-art image quality.
  • Figure 2: Illustration of (a) Multi-Stage Sampling and (b) Latent Embedding Module. We only present the first two stages of DoD, while subsequent stages are derived from the second one.
  • Figure 3: Left:rFID score with different CFG scales. A large CFG scale is required in DoD to effectively leverage visual priors. Right:FID score comparison across different stages. Visual priors are not available in Stage 1, while both stages 2 & 3 utilize the samples from the previous stage as visual priors.
  • Figure 4: Qualitative Results. We present images generated by DoD-XL with $1M$ training steps. Across all stages, semantic information is preserved, and image quality improves progressively.
  • Figure 5: Images generated by DoD-B.
  • ...and 1 more figures