Table of Contents
Fetching ...

Compress Guidance in Conditional Diffusion Sampling

Anh-Dung Dinh, Daochang Liu, Chang Xu

TL;DR

This work identifies and quantifies the problem, demonstrating that reducing or excluding guidance at numerous timesteps can mitigate this issue, and proposes a method that allows for the exclusion of a substantial number of guidance timesteps while still exceeding baseline models in image quality.

Abstract

We found that enforcing guidance throughout the sampling process is often counterproductive due to the model-fitting issue, where samples are 'tuned' to match the classifier's parameters rather than generalizing the expected condition. This work identifies and quantifies the problem, demonstrating that reducing or excluding guidance at numerous timesteps can mitigate this issue. By distributing a small amount of guidance over a large number of sampling timesteps, we observe a significant improvement in image quality and diversity while also reducing the required guidance timesteps by nearly 40%. This approach addresses a major challenge in applying guidance effectively to generative tasks. Consequently, our proposed method, termed Compress Guidance, allows for the exclusion of a substantial number of guidance timesteps while still surpassing baseline models in image quality. We validate our approach through benchmarks on label-conditional and text-to-image generative tasks across various datasets and models.

Compress Guidance in Conditional Diffusion Sampling

TL;DR

This work identifies and quantifies the problem, demonstrating that reducing or excluding guidance at numerous timesteps can mitigate this issue, and proposes a method that allows for the exclusion of a substantial number of guidance timesteps while still exceeding baseline models in image quality.

Abstract

We found that enforcing guidance throughout the sampling process is often counterproductive due to the model-fitting issue, where samples are 'tuned' to match the classifier's parameters rather than generalizing the expected condition. This work identifies and quantifies the problem, demonstrating that reducing or excluding guidance at numerous timesteps can mitigate this issue. By distributing a small amount of guidance over a large number of sampling timesteps, we observe a significant improvement in image quality and diversity while also reducing the required guidance timesteps by nearly 40%. This approach addresses a major challenge in applying guidance effectively to generative tasks. Consequently, our proposed method, termed Compress Guidance, allows for the exclusion of a substantial number of guidance timesteps while still surpassing baseline models in image quality. We validate our approach through benchmarks on label-conditional and text-to-image generative tasks across various datasets and models.
Paper Structure (20 sections, 3 theorems, 19 equations, 15 figures, 11 tables)

This paper contains 20 sections, 3 theorems, 19 equations, 15 figures, 11 tables.

Key Result

Theorem 1

Assume that $\epsilon_{\theta}$ is trained to converge and the real data density function $q(\mathbf{x}_0)$ satisfies a form of Gaussian distribution. The process of recurrent sampling $\mathbf{x}_{t-1} \sim q(\mathbf{x}_{t-1}|\mathbf{x}_t, \Tilde{\mathbf{x}}_0)$ from $T$ to $0$ is equivalent to min

Figures (15)

  • Figure 1: Stable Diffusion with classifier-free guidance. The left figure is the vanilla classifier-free guidance with application on all 50 timesteps. Our proposed Compress Guidance method is the right figure, where we only apply guidance on 10 over 50 steps. The output shows our methods' superiority over classifier-free guidance regarding image quality, quantitative performance and efficiency. The efficiency is counted based on the time to generate 30000 images with 1 GPU.
  • Figure 2: (left) OADM-C, (right) Resnet152 off-sampling loss. The On-sampling loss converges very early while leaving the off-sampling loss converges at the end of the process after the conclusion of the denoising process.
  • Figure 3: ImageNet256x256 samled by ADM-G in dhariwal2021diffusion. The top row is the vanilla guidance, where all the timesteps got the guidance information. The second and third rows are our proposed method, which only applies 35 time steps. The second row distributes the timesteps uniformly, while the third row distributes the timesteps toward the early stage of the sampling process. The Compress Guidance performs significantly better than the original guidance method. One blue stick means one guidance step.
  • Figure 4: G is denoted for vanilla guidance, UG is the uniform skipping scheme, and ES is the early stopping scheme. The graph shows that UG suffers from the non-convergence problem, and ES suffers from the forgetting problem.
  • Figure 5: Qualitative results on ImageNet256x256. Left: Vanilla guidance applied at all timesteps. Right: Compress Guidance applied at 50 out of 250 timesteps. Compress Guidance reduces over-emphasized features, correcting weird and incorrect details. Further results are in Appendix\ref{['app:qual']}
  • ...and 10 more figures

Theorems & Definitions (10)

  • Theorem 1
  • proof
  • Definition 1
  • Definition 2
  • Definition 3
  • Theorem 2
  • Theorem 3
  • proof
  • proof
  • proof