Table of Contents
Fetching ...

Di$\mathtt{[M]}$O: Distilling Masked Diffusion Models into One-step Generator

Yuanzhi Zhu, Xi Wang, Stéphane Lathuilière, Vicky Kalogeiton

TL;DR

DiMO tackles the bottleneck of slow inference in Masked Diffusion Models by distilling a multi-step teacher into a one-step generator. It introduces token-level distribution matching within an on-policy framework and employs an auxiliary model to approximate unknown student outputs, coupled with a hybrid token initialization to inject entropy. The method supports generalized $f$-divergences (e.g., the Generalized Jeffrey Divergence) and demonstrates competitive one-step performance on both class-conditional ImageNet and text-to-image generation conditioned on text prompts. This yields substantial speedups for discrete diffusion while maintaining high fidelity and diversity, facilitating real-time or resource-constrained generation with discrete token vocabularies.

Abstract

Masked Diffusion Models (MDMs) have emerged as a powerful generative modeling technique. Despite their remarkable results, they typically suffer from slow inference with several steps. In this paper, we propose Di$\mathtt{[M]}$O, a novel approach that distills masked diffusion models into a one-step generator. Di$\mathtt{[M]}$O addresses two key challenges: (1) the intractability of using intermediate-step information for one-step generation, which we solve through token-level distribution matching that optimizes model output logits by an 'on-policy framework' with the help of an auxiliary model; and (2) the lack of entropy in the initial distribution, which we address through a token initialization strategy that injects randomness while maintaining similarity to teacher training distribution. We show Di$\mathtt{[M]}$O's effectiveness on both class-conditional and text-conditional image generation, impressively achieving performance competitive to multi-step teacher outputs while drastically reducing inference time. To our knowledge, we are the first to successfully achieve one-step distillation of masked diffusion models and the first to apply discrete distillation to text-to-image generation, opening new paths for efficient generative modeling.

Di$\mathtt{[M]}$O: Distilling Masked Diffusion Models into One-step Generator

TL;DR

DiMO tackles the bottleneck of slow inference in Masked Diffusion Models by distilling a multi-step teacher into a one-step generator. It introduces token-level distribution matching within an on-policy framework and employs an auxiliary model to approximate unknown student outputs, coupled with a hybrid token initialization to inject entropy. The method supports generalized -divergences (e.g., the Generalized Jeffrey Divergence) and demonstrates competitive one-step performance on both class-conditional ImageNet and text-to-image generation conditioned on text prompts. This yields substantial speedups for discrete diffusion while maintaining high fidelity and diversity, facilitating real-time or resource-constrained generation with discrete token vocabularies.

Abstract

Masked Diffusion Models (MDMs) have emerged as a powerful generative modeling technique. Despite their remarkable results, they typically suffer from slow inference with several steps. In this paper, we propose DiO, a novel approach that distills masked diffusion models into a one-step generator. DiO addresses two key challenges: (1) the intractability of using intermediate-step information for one-step generation, which we solve through token-level distribution matching that optimizes model output logits by an 'on-policy framework' with the help of an auxiliary model; and (2) the lack of entropy in the initial distribution, which we address through a token initialization strategy that injects randomness while maintaining similarity to teacher training distribution. We show DiO's effectiveness on both class-conditional and text-conditional image generation, impressively achieving performance competitive to multi-step teacher outputs while drastically reducing inference time. To our knowledge, we are the first to successfully achieve one-step distillation of masked diffusion models and the first to apply discrete distillation to text-to-image generation, opening new paths for efficient generative modeling.

Paper Structure

This paper contains 27 sections, 18 equations, 19 figures, 8 tables, 1 algorithm.

Figures (19)

  • Figure 1: Unlike continuous Diffusion Models (DMs) that have beed successfully distilled into one-step DM generators with performances competitive with the teacher, distilling Masked Diffusion Models (MDM) into one-step generator remains a challenge. In this paper, we propose the firstone-step distillation method: Di$\mathtt{[M]}{}$O for MDM, e.g., from the recent text-to-image masked diffusion model Meissonic bai2024meissonic. We demonstrate that our Di$\mathtt{[M]}{}$O can successfully distill the teacher model into a one-step generator, achieving competitive performance both quantitatively and qualitatively, while the teacher model's performance deteriorates rapidly with reduced generation steps, i.e., comparing our one-step results (large, bottom images) to 4-step teacher outputs (right corner of each image).
  • Figure 2: Reverse process of MDM. With a masked sequence $x_t$ as input, the MDM independently output logits $z_\phi^i$ at each masked position $i$, which are then used to sample the new tokens $x_\phi^i$ as the model prediction. In the next state $x_s$, we use $x_\phi^i$ to replace each masked token in $x_t$ with probability $(r_t-r_s)/r_t$.
  • Figure 3: Di$\mathtt{[M]}{}$O Pipeline. Our method distills a costly multi-step MDM teacher into a one-step generator. Given $x_{\text{init}}$ sampled using our proposed token initialization strategy, the one-step generator (student $\theta$) produces logits $z_\theta$, from which image token sequence $x_{\theta}=\{x_{\theta}^i\}_{i=1}^{L}$ are sampled. These tokens are then processed to obtain an intermediate state $\tilde{x}_t$ through a forward mask diffusion process. For each intermediate state $\tilde{x}_t$, we update the one-step generator $\theta$ and auxiliary model $\psi$ alternately: the one-step generator is optimized by minimizing the conditional divergence ${{D}(p_\phi||p_\psi)(\tilde{x}_t)}$ at token-level, while the auxiliary model is trained using a cross-entropy loss to model the distribution of generated tokens $x_{\theta}$ and to form the gradient to update $\theta$. The teacher $\phi$ is frozen during training.
  • Figure 4: Visual results of ImageNet. One-step generated images from the generator trained with different $r_{\text{init}}$ in comparison with teacher generation with 16 sampling steps. The class labels of the samples from top to bottom are 388, 979 and 207 respectively.
  • Figure 5: Ablation studies on ImageNet using FID as the evaluation metric. $^*$ means the training is collapsed and falls outside the comparable range with other results, we show these in the sub-figures at the right upper corner with the same x-axis range.
  • ...and 14 more figures