Table of Contents
Fetching ...

Improving Classifier-Free Guidance in Masked Diffusion: Low-Dim Theoretical Insights with High-Dim Impact

Kevin Rojas, Ye He, Chieh-Hsin Lai, Yuhta Takida, Yuki Mitsufuji, Molei Tao

TL;DR

This paper starts by analyzing the exact effect of CFG in the context of a low-dimensional masked diffusion model, with a special emphasis on the guidance schedule, and proposes a novel classifier-free guidance mechanism that smooths the transport between the data distribution and the initial (masked) distribution, resulting in improved sample quality.

Abstract

Classifier-Free Guidance (CFG) is a widely used technique for conditional generation and improving sample quality in continuous diffusion models, and its extensions to discrete diffusion has recently started to be investigated. In order to improve the algorithms in a principled way, this paper starts by analyzing the exact effect of CFG in the context of a low-dimensional masked diffusion model, with a special emphasis on the guidance schedule. Our analysis shows that high guidance early in sampling (when inputs are heavily masked) harms generation quality, while late-stage guidance improves it. These findings provide a theoretical explanation for empirical observations in recent studies on guidance schedules. The analysis also reveals an imperfection of the current CFG implementations. These implementations can unintentionally cause imbalanced transitions, such as unmasking too rapidly during the early stages of generation, which degrades the quality of the resulting samples. To address this, we draw insight from the analysis and propose a novel classifier-free guidance mechanism. Intuitively, our method smooths the transport between the data distribution and the initial (masked) distribution, resulting in improved sample quality. Remarkably, our method is achievable via a simple one-line code change. Experiments on conditional image and text generation empirically confirm the efficacy of our method.

Improving Classifier-Free Guidance in Masked Diffusion: Low-Dim Theoretical Insights with High-Dim Impact

TL;DR

This paper starts by analyzing the exact effect of CFG in the context of a low-dimensional masked diffusion model, with a special emphasis on the guidance schedule, and proposes a novel classifier-free guidance mechanism that smooths the transport between the data distribution and the initial (masked) distribution, resulting in improved sample quality.

Abstract

Classifier-Free Guidance (CFG) is a widely used technique for conditional generation and improving sample quality in continuous diffusion models, and its extensions to discrete diffusion has recently started to be investigated. In order to improve the algorithms in a principled way, this paper starts by analyzing the exact effect of CFG in the context of a low-dimensional masked diffusion model, with a special emphasis on the guidance schedule. Our analysis shows that high guidance early in sampling (when inputs are heavily masked) harms generation quality, while late-stage guidance improves it. These findings provide a theoretical explanation for empirical observations in recent studies on guidance schedules. The analysis also reveals an imperfection of the current CFG implementations. These implementations can unintentionally cause imbalanced transitions, such as unmasking too rapidly during the early stages of generation, which degrades the quality of the resulting samples. To address this, we draw insight from the analysis and propose a novel classifier-free guidance mechanism. Intuitively, our method smooths the transport between the data distribution and the initial (masked) distribution, resulting in improved sample quality. Remarkably, our method is achievable via a simple one-line code change. Experiments on conditional image and text generation empirically confirm the efficacy of our method.

Paper Structure

This paper contains 32 sections, 12 theorems, 54 equations, 26 figures, 4 tables.

Key Result

Theorem 3.1

(Informal) Along the dynamics of equation eq:guided process, starting from a fully masked state, the distribution at time $t$ is given by:

Figures (26)

  • Figure 1: We proposed an improved guidance mechanism through column normalization. Our method produces sharper images while being more stable to the guidance strength. Notably, it requires only a minor code modification.
  • Figure 2: We plot the unmasking rates as a function of time under guidance. Faster unmasking ($Z_w >1$) leads to worse numerical solvers, demonstrating an issue in the existing guidance mechanism.
  • Figure 3: Tilted distributions for varying values of $w$. Large $w$ concentrates mass on one mode.
  • Figure 4: Evolution of the coefficients in Corollary \ref{['cor:step-wise-3']} for different values of $t_2$, with $t_1 \leq t_2$. For moderate $t_2$, no single coefficient dominates, yielding a balanced target distribution.
  • Figure 5: Notice that when $\omega < \gamma$ the combined distribution doesn't bias the leftmost mode, making this setting less efficient for guidance.
  • ...and 21 more figures

Theorems & Definitions (20)

  • Theorem 3.1
  • Lemma 3.1
  • Corollary 3.1
  • Lemma A.1
  • Lemma B.1
  • proof
  • Theorem B.1
  • proof
  • Corollary B.1
  • proof
  • ...and 10 more