Table of Contents
Fetching ...

The Diffusion Duality, Chapter II: $Ψ$-Samplers and Efficient Curriculum

Justin Deschenaux, Caglar Gulcehre, Subham Sekhar Sahoo

TL;DR

A family of Predictor-Corrector (PC) samplers for discrete diffusion that generalize prior methods and apply to arbitrary noise processes and call into question the assumption that Masked diffusion is the inevitable future of diffusion-based language modeling.

Abstract

Uniform-state discrete diffusion models excel at few-step generation and guidance due to their ability to self-correct, making them preferred over autoregressive or Masked diffusion models in these settings. However, their sampling quality plateaus with ancestral samplers as the number of steps increases. We introduce a family of Predictor-Corrector (PC) samplers for discrete diffusion that generalize prior methods and apply to arbitrary noise processes. When paired with uniform-state diffusion, our samplers outperform ancestral sampling on both language and image modeling, achieving lower generative perplexity at matched unigram entropy on OpenWebText and better FID/IS scores on CIFAR10. Crucially, unlike conventional samplers, our PC methods continue to improve with more sampling steps. Taken together, these findings call into question the assumption that Masked diffusion is the inevitable future of diffusion-based language modeling. Beyond sampling, we develop a memory-efficient curriculum for the Gaussian relaxation training phase, reducing training time by 25% and memory by 33% compared to Duo while maintaining comparable perplexity on OpenWebText and LM1B and strong downstream performance. We release code, checkpoints, and a video-tutorial on: https://s-sahoo.com/duo-ch2

The Diffusion Duality, Chapter II: $Ψ$-Samplers and Efficient Curriculum

TL;DR

A family of Predictor-Corrector (PC) samplers for discrete diffusion that generalize prior methods and apply to arbitrary noise processes and call into question the assumption that Masked diffusion is the inevitable future of diffusion-based language modeling.

Abstract

Uniform-state discrete diffusion models excel at few-step generation and guidance due to their ability to self-correct, making them preferred over autoregressive or Masked diffusion models in these settings. However, their sampling quality plateaus with ancestral samplers as the number of steps increases. We introduce a family of Predictor-Corrector (PC) samplers for discrete diffusion that generalize prior methods and apply to arbitrary noise processes. When paired with uniform-state diffusion, our samplers outperform ancestral sampling on both language and image modeling, achieving lower generative perplexity at matched unigram entropy on OpenWebText and better FID/IS scores on CIFAR10. Crucially, unlike conventional samplers, our PC methods continue to improve with more sampling steps. Taken together, these findings call into question the assumption that Masked diffusion is the inevitable future of diffusion-based language modeling. Beyond sampling, we develop a memory-efficient curriculum for the Gaussian relaxation training phase, reducing training time by 25% and memory by 33% compared to Duo while maintaining comparable perplexity on OpenWebText and LM1B and strong downstream performance. We release code, checkpoints, and a video-tutorial on: https://s-sahoo.com/duo-ch2
Paper Structure (86 sections, 8 theorems, 62 equations, 14 figures, 14 tables, 3 algorithms)

This paper contains 86 sections, 8 theorems, 62 equations, 14 figures, 14 tables, 3 algorithms.

Key Result

Proposition B.1

$U^{(1)} \geq U^{(2)} \geq ... \geq U^{(K)}$ denote an order statistic over $K$ i.i.d uniform random variables $\mathcal{U}([0, \theta])$ with Cumulative Density Function (CDF) $F_U$. Suppose that $u \in [0, 1]$, then $F_U(u) = \frac{u}{\theta}$. Then, the CDF $F_{U^{(1)}}$ and probability density f

Figures (14)

  • Figure 1: Performance on Language Modeling and Image Modeling.$\Psi$-samplers generalize ReMDM wang2025remaskingdiscretediffusionmodels to arbitrary noise distributions. (Left): Generative perplexity (Gen. PPL; $\downarrow$) as a function of NFEs, with nucleus sampling $p=0.9$. $\Psi$-samplers consistently improve with more steps, unlike ancestral sampling which plateaus. Curves are annotated with the average unigram entropy per sequence as a proxy for diversity. (Right): On CIFAR-10, $\Psi$-samplers achieve better FID ($\downarrow$) than MDLM (with ReMDM).
  • Figure 2: $\Psi$-samplers combine predictor and corrector steps. The predictor transitions from ${\mathbf z}_t$ to ${\mathbf z}_s$ via ${q_{s|t}}$, but fails to remask tokens in MDMs. The corrector steps inject noise via $q_s$, to revise earlier predictions. For $\kappa_t < 1$, noise injection enables error correction while preserving the forward process marginals. Our framework extends prior PC methods campbell2022continuoustimeframeworkdiscretegat2024discreteflowmatchingwang2025remaskingdiscretediffusionmodels to arbitrary priors $\boldsymbol{\pi}$.
  • Figure 3: Efficient Curriculum for USDMs. Duo sahoo2025the replaces discrete lookups with linear combinations of all $K$ embeddings: (1) Gaussian diffusion on one-hot representations, (2) Low-temperature ${\color{argmaxcolor}\mathrm{softmax}}$, (3) weighted sum. $\text{Duo}^{++}$ exploits the sparsity of the tempered softmax (most weights are effectively zero), and simulate the k largest entries (out of K) using ordered statistics. The approximate normalizer ${\color{argmaxcolor} \tilde{Z}}$ admits a closed form expression (\ref{['eq:curriculum-normalization-main-body']}). $\text{Duo}^{++}$ has a 33% lower memory and 25% faster training than Duo.
  • Figure 4: Illustration of the possible evolution of $t$ and the associated $\kappa_t$. In practice, we use $\kappa_t$ close to 1 during the PC phase.
  • Figure 5: Polynomial approximation and approximation error, compared to the series approximation, truncated at 150 terms. The degree-$9$ polynomial (left) achieves orders of magnitude lower error than the degree-$5$ polynomial (center) and sigmoid (right) approximations.
  • ...and 9 more figures

Theorems & Definitions (13)

  • Proposition B.1: Distribution of the largest uniform random variable out of $K$
  • proof
  • Proposition B.2: Conditional Density Berger2001-hq
  • Proposition B.3: Joint Density of Order Statistics (Berger2001-hq; proof in kim2021)
  • Proposition B.4: Conditional Distribution of $U^{(i+1)}$ given $U^{(i)}$
  • proof
  • proof
  • Proposition B.5: First Corollary of the Dominated Convergence Theorem (Folland1999-lc, Theorem 2.25)
  • Proposition B.6: Second Corollary of the Dominated Convergence Theorem (Folland1999-lc, Theorem 2.27)
  • Proposition B.7: Series Expansion of the Diffusion Transformation Operator
  • ...and 3 more