Table of Contents
Fetching ...

Demystifying MaskGIT Sampler and Beyond: Adaptive Order Selection in Masked Diffusion

Satoshi Hayakawa, Yuhta Takida, Masaaki Imaizumi, Hiromi Wakaki, Yuki Mitsufuji

TL;DR

The paper addresses the inefficiency of sampling in discrete masked diffusion, specifically scrutinizing the MaskGIT sampler. It reveals that MaskGIT inherently performs temperature sampling and proposes the moment sampler as an asymptotically equivalent choose-then-sample alternative, enabling clearer interpretation. Two practical improvements—partial caching for transformers and a hybrid exploration-exploitation strategy for adaptive unmasking—are introduced to enhance CTS-based samplers. The authors validate their theory in image and language tasks, showing that the moment sampler closely mirrors MaskGIT in performance and that the hybrid approach yields meaningful speedups and improved trade-offs. Overall, this work advances both theoretical understanding and practical efficiency of masked diffusion samplers across modalities.

Abstract

Masked diffusion models have shown promising performance in generating high-quality samples in a wide range of domains, but accelerating their sampling process remains relatively underexplored. To investigate efficient samplers for masked diffusion, this paper theoretically analyzes the MaskGIT sampler for image modeling, revealing its implicit temperature sampling mechanism. Through this analysis, we introduce the "moment sampler," an asymptotically equivalent but more tractable and interpretable alternative to MaskGIT, which employs a "choose-then-sample" approach by selecting unmasking positions before sampling tokens. In addition, we improve the efficiency of choose-then-sample algorithms through two key innovations: a partial caching technique for transformers that approximates longer sampling trajectories without proportional computational cost, and a hybrid approach formalizing the exploration-exploitation trade-off in adaptive unmasking. Experiments in image and text domains demonstrate our theory as well as the efficiency of our proposed methods, advancing both theoretical understanding and practical implementation of masked diffusion samplers.

Demystifying MaskGIT Sampler and Beyond: Adaptive Order Selection in Masked Diffusion

TL;DR

The paper addresses the inefficiency of sampling in discrete masked diffusion, specifically scrutinizing the MaskGIT sampler. It reveals that MaskGIT inherently performs temperature sampling and proposes the moment sampler as an asymptotically equivalent choose-then-sample alternative, enabling clearer interpretation. Two practical improvements—partial caching for transformers and a hybrid exploration-exploitation strategy for adaptive unmasking—are introduced to enhance CTS-based samplers. The authors validate their theory in image and language tasks, showing that the moment sampler closely mirrors MaskGIT in performance and that the hybrid approach yields meaningful speedups and improved trade-offs. Overall, this work advances both theoretical understanding and practical efficiency of masked diffusion samplers across modalities.

Abstract

Masked diffusion models have shown promising performance in generating high-quality samples in a wide range of domains, but accelerating their sampling process remains relatively underexplored. To investigate efficient samplers for masked diffusion, this paper theoretically analyzes the MaskGIT sampler for image modeling, revealing its implicit temperature sampling mechanism. Through this analysis, we introduce the "moment sampler," an asymptotically equivalent but more tractable and interpretable alternative to MaskGIT, which employs a "choose-then-sample" approach by selecting unmasking positions before sampling tokens. In addition, we improve the efficiency of choose-then-sample algorithms through two key innovations: a partial caching technique for transformers that approximates longer sampling trajectories without proportional computational cost, and a hybrid approach formalizing the exploration-exploitation trade-off in adaptive unmasking. Experiments in image and text domains demonstrate our theory as well as the efficiency of our proposed methods, advancing both theoretical understanding and practical implementation of masked diffusion samplers.

Paper Structure

This paper contains 38 sections, 7 theorems, 53 equations, 4 figures, 1 table, 3 algorithms.

Key Result

Proposition 1

Suppose we are given $\mu_1,\ldots,\mu_N\in\mathbb{R}$ and i.i.d. standard Gumbel noise $\xi_1,\ldots,\xi_N$. Let $(i^*_1,\ldots,i^*_k) = \operatorname{argtop\space{\it k}}_{i\in[N]}\{\mu_i+\xi_i\}$. Then, for distinct indices $i_1,\ldots,i_\ell \in [N]$ with $\ell\le k$, we have $\mathbb{P}\!\left(

Figures (4)

  • Figure 1: Overview of our contributions. Both samplers determine tokens at two out of five positions, but with different order of positional choice and token sampling. While they are asymptotically equivalent as we show in \ref{['thm:main']}, the moment sampler belongs to the family of "choose-then-sample" methods, which we can further enhance in two ways as described in Section \ref{['sec:method']}.
  • Figure 2: Illustration of partial caching approximation applied to an $L$-layer transformer, where $\sigma=\mathop\mathrm{softmax}(\cdot/\sqrt{d_k})$, with $d_k$ being the dimension of key and query vectors.
  • Figure 3: Fréchet Inception Distance (FID, $\downarrow$) and Inception Score ($\uparrow$) against the number of steps for various samplers with MAGE. Both metrics were computed by 50,000 generated images.
  • Figure 6: Additional experimental results. (Left) Generative Perplexity of various samplers with temperature sampling. (Right) Generative Perplexity of our proposed samplers against sampling time per batch on H100 GPU.

Theorems & Definitions (13)

  • Proposition 1: Gumbel-top-$k$ trick, kool2019stochastic
  • Theorem 2: Moment sampler approximates MaskGIT in the $N\gg k^2$ regime
  • Proposition 3: One-by-one CTS algorithm is unbiased
  • Proposition 4
  • Proposition 5
  • Proposition 6
  • Theorem 7
  • proof
  • proof
  • proof
  • ...and 3 more