Table of Contents
Fetching ...

Beyond Single Tokens: Distilling Discrete Diffusion Models via Discrete MMD

Emiel Hoogeboom, David Ruhe, Jonathan Heek, Thomas Mensink, Tim Salimans

Abstract

It is currently difficult to distill discrete diffusion models. In contrast, continuous diffusion literature has many distillation approaches methods that can reduce sampling steps to a handful. Our method, Discrete Moment Matching Distillation (D-MMD), leverages ideas that have been highly successful in the continuous domain. Whereas previous discrete distillation methods collapse, D-MMD maintains high quality and diversity (given sufficient sampling steps). This is demonstrated on both text and image datasets. Moreover, the newly distilled generators can outperform their teachers.

Beyond Single Tokens: Distilling Discrete Diffusion Models via Discrete MMD

Abstract

It is currently difficult to distill discrete diffusion models. In contrast, continuous diffusion literature has many distillation approaches methods that can reduce sampling steps to a handful. Our method, Discrete Moment Matching Distillation (D-MMD), leverages ideas that have been highly successful in the continuous domain. Whereas previous discrete distillation methods collapse, D-MMD maintains high quality and diversity (given sufficient sampling steps). This is demonstrated on both text and image datasets. Moreover, the newly distilled generators can outperform their teachers.
Paper Structure (33 sections, 24 equations, 6 figures, 6 tables, 1 algorithm)

This paper contains 33 sections, 24 equations, 6 figures, 6 tables, 1 algorithm.

Figures (6)

  • Figure 1: D-MMD text generators match or outperform their teacher using fewer function evaluations.
  • Figure 2: Excerpt of a random 1024-token sample generated using 16-step Masked D-MMD, not cherry picked.
  • Figure 3: The perplexity metric keeps improving with lower temperature sampling while the grad moment eventually degrades.
  • Figure 4: FID performance vs sampling temperature or top $p$ value in posterior sampling.
  • Figure 5: FID performance vs teacher temperature while MMD'ing the student model.
  • ...and 1 more figures