Generalized Interpolating Discrete Diffusion
Dimitri von Rütte, Janis Fluri, Yuhui Ding, Antonio Orvieto, Bernhard Schölkopf, Thomas Hofmann
TL;DR
The paper addresses enabling token revision in language diffusion models by generalizing discrete diffusion to Generalized Interpolating Diffusion (GIDD), which mixes data with time-varying distributions to form flexible forward processes. It derives a continuous-time ELBO (CT-NELBO), establishes equivalence to masked diffusion when using a fixed mask, and introduces a hybrid masking+uniform-noise schedule that enables self-correction during generation. Empirically, GIDD achieves compute-matched state-of-the-art perplexity for diffusion models on OpenWebText with a reweighting of the ELBO, and demonstrates that uniform noise improves sample quality and enables self-correction, especially when scaling model size and denoising steps. The work also shows that diffusion-based language modeling can rival autoregressive approaches in downstream tasks, highlighting the practical potential of GIDD for scalable, controllable text generation and error correction.
Abstract
While state-of-the-art language models achieve impressive results through next-token prediction, they have inherent limitations such as the inability to revise already generated tokens. This has prompted exploration of alternative approaches such as discrete diffusion. However, masked diffusion, which has emerged as a popular choice due to its simplicity and effectiveness, reintroduces this inability to revise words. To overcome this, we generalize masked diffusion, deriving a new family of general interpolating discrete diffusion (GIDD) which offers greater flexibility in the design of the noising processes. Leveraging a novel diffusion ELBO, we achieve compute-matched state-of-the-art performance in diffusion language modeling. Exploiting GIDD's flexibility, we explore a hybrid approach combining masking and uniform noise, leading to improved sample quality and unlocking the ability for the model to correct its own mistakes, an area where autoregressive models notoriously have struggled. Code: https://github.com/dvruette/gidd/
