Table of Contents
Fetching ...

The Diffusion Duality

Subham Sekhar Sahoo, Justin Deschenaux, Aaron Gokaslan, Guanghan Wang, Justin Chiu, Volodymyr Kuleshov

TL;DR

The paper reveals a fundamental link between continuous Gaussian diffusion and Uniform-state discrete diffusion, showing that discrete diffusion marginals can be obtained by pushing Gaussian latents through an argmax mapping. By exploiting this Diffusion Duality, the authors transfer Gaussian diffusion techniques to discrete diffusion, enabling a low-variance, faster training regime via curriculum learning and a two-order-of-magnitude speedup in sampling through Discrete Consistency Distillation. Their Duo framework demonstrates competitive zero-shot perplexities and strong sample quality, outperforming prior discrete diffusion methods in low-NFE regimes and approaching autoregressive models on several benchmarks. The work also introduces a Rao-Blackwellized NELBO to reduce training variance and extends to sequence-level diffusion, with extensive ablations and comparisons against state-of-the-art baselines. Overall, Duo provides a practical pathway to rapid, high-quality discrete diffusion for language modeling and highlights a versatile framework for cross-pollinating continuous and discrete diffusion paradigms.

Abstract

Uniform-state discrete diffusion models hold the promise of fast text generation due to their inherent ability to self-correct. However, they are typically outperformed by autoregressive models and masked diffusion models. In this work, we narrow this performance gap by leveraging a key insight: Uniform-state diffusion processes naturally emerge from an underlying Gaussian diffusion. Our method, Duo, transfers powerful techniques from Gaussian diffusion to improve both training and sampling. First, we introduce a curriculum learning strategy guided by the Gaussian process, doubling training speed by reducing variance. Models trained with curriculum learning surpass autoregressive models in zero-shot perplexity on 3 of 7 benchmarks. Second, we present Discrete Consistency Distillation, which adapts consistency distillation from the continuous to the discrete setting. This algorithm unlocks few-step generation in diffusion language models by accelerating sampling by two orders of magnitude. We provide the code, model checkpoints, and video tutorials on the project page: http://s-sahoo.github.io/duo

The Diffusion Duality

TL;DR

The paper reveals a fundamental link between continuous Gaussian diffusion and Uniform-state discrete diffusion, showing that discrete diffusion marginals can be obtained by pushing Gaussian latents through an argmax mapping. By exploiting this Diffusion Duality, the authors transfer Gaussian diffusion techniques to discrete diffusion, enabling a low-variance, faster training regime via curriculum learning and a two-order-of-magnitude speedup in sampling through Discrete Consistency Distillation. Their Duo framework demonstrates competitive zero-shot perplexities and strong sample quality, outperforming prior discrete diffusion methods in low-NFE regimes and approaching autoregressive models on several benchmarks. The work also introduces a Rao-Blackwellized NELBO to reduce training variance and extends to sequence-level diffusion, with extensive ablations and comparisons against state-of-the-art baselines. Overall, Duo provides a practical pathway to rapid, high-quality discrete diffusion for language modeling and highlights a versatile framework for cross-pollinating continuous and discrete diffusion paradigms.

Abstract

Uniform-state discrete diffusion models hold the promise of fast text generation due to their inherent ability to self-correct. However, they are typically outperformed by autoregressive models and masked diffusion models. In this work, we narrow this performance gap by leveraging a key insight: Uniform-state diffusion processes naturally emerge from an underlying Gaussian diffusion. Our method, Duo, transfers powerful techniques from Gaussian diffusion to improve both training and sampling. First, we introduce a curriculum learning strategy guided by the Gaussian process, doubling training speed by reducing variance. Models trained with curriculum learning surpass autoregressive models in zero-shot perplexity on 3 of 7 benchmarks. Second, we present Discrete Consistency Distillation, which adapts consistency distillation from the continuous to the discrete setting. This algorithm unlocks few-step generation in diffusion language models by accelerating sampling by two orders of magnitude. We provide the code, model checkpoints, and video tutorials on the project page: http://s-sahoo.github.io/duo

Paper Structure

This paper contains 92 sections, 2 theorems, 64 equations, 20 figures, 7 tables, 2 algorithms.

Key Result

Theorem 3.1

The reverse discrete-diffusion kernel ${\color{discretecolor} p^\theta_{s|t}}$ that ensures $\left({\color{discretecolor} {p^\theta_t}} = [{\color{argmaxcolor}\mathop{\mathrm{arg\,max}}\limits}]_{\filledstar} {\color{gaussiancolor} {\bar{p}^\theta}_t} \right)_{t \in [0, 1]}$ is given by

Figures (20)

  • Figure 1: An illustration of Uniform-state discrete diffusion (top) and the underlying Gaussian diffusion (bottom). While both are separate Markov processes, applying ${\color{argmaxcolor}\mathop{\mathrm{arg\,max}}\limits}$ on Gaussian latents ${\mathbf w}_t \in \mathbb{R}^n$ converts them to discrete latents ${\mathbf z}_t \in \mathcal{V}$, transforming their marginals from ${\color{gaussiancolor} \tilde{q}_t}(.|{\mathbf x}; {\color{gaussiancolor} \tilde{\alpha}_{t}})$ (\ref{['eqn:gaussian_marginal']}) to ${\color{discretecolor} q_t}(.|{\mathbf x}; {\color{argmaxcolor} \mathcal{T}}({\color{gaussiancolor} \tilde{\alpha}_{t}}))$ (\ref{['eqn:discrete_marginal']}) and adjusting diffusion parameters from ${\color{gaussiancolor} \tilde{\alpha}_{t}}$ to ${\color{discretecolor} \alpha_{t}} = {\color{argmaxcolor} \mathcal{T}}({\color{gaussiancolor} \tilde{\alpha}_{t}})$ (\ref{['eqn:coefficient_relation']}). Notably, the ELBO for Uniform-state diffusion induces a tighter bound on the likelihood than Gaussian diffusion, as established in Theorem \ref{['theorem:elbo']}.
  • Figure 2: Curriculum learning drastically lowers the gradient variance in Duo trained with a fixed $\tau=0.001$. The figure shows the summed gradient variance of the 100 weights with the highest variance, comparing Duo with CL (blue) and without CL (grey).
  • Figure 3: Sample quality comparison of Duo vs. MDLM. Duo outperforms MDLM in Gen PPL ($\downarrow$) for base models and in low-NFE regime after 5 distillation rounds.
  • Figure 4: Sample quality of the base Duo model vs. Duo distilled for 5 rounds with DCD. With ancestral sampler, the distilled model matches base quality in 16 steps (vs. 1024), and with Greedy-Tail needs only 8 steps but with slightly reduced sample diversity.
  • Figure 5: Comparison of sample generation processes in various discrete sequence models; see Suppl. \ref{['supp:generation']} for a detailed discussion. (a) Autoregressive Model: Tokens are generated sequentially, one at a time, from left to right. (b) Masked Diffusion: Once unmasked, a token remains fixed, though multiple tokens may be denoised simultaneously at each step. (c) Uniform-state Diffusion: Tokens can visit several intermediate states during the diffusion process. (d) $\bm{\mathcal{P}}_{\textbf{DDT}}$: Similar to USDMs, generation begins with a sequence of randomly initialized tokens. However, once a token flips, it remains fixed throughout the reverse generation process. Thus, the generation process closely resembles to that of MDMs.
  • ...and 15 more figures

Theorems & Definitions (2)

  • Theorem 3.1
  • Theorem 3.2