Table of Contents
Fetching ...

Generalized Interpolating Discrete Diffusion

Dimitri von Rütte, Janis Fluri, Yuhui Ding, Antonio Orvieto, Bernhard Schölkopf, Thomas Hofmann

TL;DR

The paper addresses enabling token revision in language diffusion models by generalizing discrete diffusion to Generalized Interpolating Diffusion (GIDD), which mixes data with time-varying distributions to form flexible forward processes. It derives a continuous-time ELBO (CT-NELBO), establishes equivalence to masked diffusion when using a fixed mask, and introduces a hybrid masking+uniform-noise schedule that enables self-correction during generation. Empirically, GIDD achieves compute-matched state-of-the-art perplexity for diffusion models on OpenWebText with a reweighting of the ELBO, and demonstrates that uniform noise improves sample quality and enables self-correction, especially when scaling model size and denoising steps. The work also shows that diffusion-based language modeling can rival autoregressive approaches in downstream tasks, highlighting the practical potential of GIDD for scalable, controllable text generation and error correction.

Abstract

While state-of-the-art language models achieve impressive results through next-token prediction, they have inherent limitations such as the inability to revise already generated tokens. This has prompted exploration of alternative approaches such as discrete diffusion. However, masked diffusion, which has emerged as a popular choice due to its simplicity and effectiveness, reintroduces this inability to revise words. To overcome this, we generalize masked diffusion, deriving a new family of general interpolating discrete diffusion (GIDD) which offers greater flexibility in the design of the noising processes. Leveraging a novel diffusion ELBO, we achieve compute-matched state-of-the-art performance in diffusion language modeling. Exploiting GIDD's flexibility, we explore a hybrid approach combining masking and uniform noise, leading to improved sample quality and unlocking the ability for the model to correct its own mistakes, an area where autoregressive models notoriously have struggled. Code: https://github.com/dvruette/gidd/

Generalized Interpolating Discrete Diffusion

TL;DR

The paper addresses enabling token revision in language diffusion models by generalizing discrete diffusion to Generalized Interpolating Diffusion (GIDD), which mixes data with time-varying distributions to form flexible forward processes. It derives a continuous-time ELBO (CT-NELBO), establishes equivalence to masked diffusion when using a fixed mask, and introduces a hybrid masking+uniform-noise schedule that enables self-correction during generation. Empirically, GIDD achieves compute-matched state-of-the-art perplexity for diffusion models on OpenWebText with a reweighting of the ELBO, and demonstrates that uniform noise improves sample quality and enables self-correction, especially when scaling model size and denoising steps. The work also shows that diffusion-based language modeling can rival autoregressive approaches in downstream tasks, highlighting the practical potential of GIDD for scalable, controllable text generation and error correction.

Abstract

While state-of-the-art language models achieve impressive results through next-token prediction, they have inherent limitations such as the inability to revise already generated tokens. This has prompted exploration of alternative approaches such as discrete diffusion. However, masked diffusion, which has emerged as a popular choice due to its simplicity and effectiveness, reintroduces this inability to revise words. To overcome this, we generalize masked diffusion, deriving a new family of general interpolating discrete diffusion (GIDD) which offers greater flexibility in the design of the noising processes. Leveraging a novel diffusion ELBO, we achieve compute-matched state-of-the-art performance in diffusion language modeling. Exploiting GIDD's flexibility, we explore a hybrid approach combining masking and uniform noise, leading to improved sample quality and unlocking the ability for the model to correct its own mistakes, an area where autoregressive models notoriously have struggled. Code: https://github.com/dvruette/gidd/

Paper Structure

This paper contains 48 sections, 10 theorems, 70 equations, 8 figures, 8 tables, 1 algorithm.

Key Result

Proposition 3.3

Let $\alpha_t$, $\beta_t = 1 - \alpha_t$ denote the mixing rate and let $\boldsymbol{\pi}_t$ denote the mixing distribution. Then there exists a continuous-time Markov chain with transition probabilities from state $z_s$ to $z_t$ at times $s \leq t$ given by where $\alpha_{t|s} = \frac{\alpha_t}{\alpha_s}$, $\beta_{t|s}\boldsymbol{\pi}_{t|s} = \beta_t\boldsymbol{\pi}_t - \frac{\alpha_t}{\alpha_s}

Figures (8)

  • Figure 1: Training a diffusion model using GIDD on a combination of masking and uniform noise teaches it to identify and correct its own mistakes. By iteratively replacing bad tokens with better ones (as determined by the model), sample quality (as per generative PPL via Gemma 2 9B) improves by up to 55%.
  • Figure 2: ELBO weights grow exponentially for very low/high noise levels, causing poor optimization if not handled carefully. While masked and uniform token weights are almost constant, noise-free token weights vary heavily depending on $p_u$.
  • Figure 3: From left to right: (a) Self-correction using GIDD+ (base) models resamples up to 10% of tokens independent of the uniform noise level. A temperature of $\tau \in [0.1, 0.5]$ is found to be most effective. (b) For models trained on hybrid noise, sample quality (PPL) improves significantly as more tokens are changed. The mask-only model, though, is unable to improve quality despite resampling as many tokens. Sample diversity (entropy) drops noticeably for mask-only models, but only slightly for hybrid models. (c) The correlation between self-accuracy and generative PPL reveals that hybrid models are significantly better at judging the quality of their own samples.
  • Figure 4: Plotting the compute-efficient frontier reveals different scaling behaviors for different uniform noise levels, revealing that training with uniform noise benefits slightly more from scaling compute compared to the mask-only setting.
  • Figure 5: Self-correction results for our small models. While the overall trend is the same as for base models, the best-performing model uses $p_u = 0.1$ instead of $p_u = 0.2$, suggesting that the ideal uniform noise ratio depends on model size. The MDM baseline is noticeably worse than the mask-only GIDD implementation, with self-correction yielding negative improvements, which is likely due to numerical limitations in the GIDD implementation.
  • ...and 3 more figures

Theorems & Definitions (28)

  • Definition 3.1: Mixing Rate
  • Definition 3.2: Mixing Distribution
  • Proposition 3.3: GIDD Conditional Transitions
  • proof : Proof
  • Corollary 3.4
  • proof
  • Definition 3.5: CTMC Forward Transition
  • Lemma 3.6: GIDD Forward Rate
  • proof
  • Theorem 3.7: GIDD ELBO
  • ...and 18 more