Scaling Beyond Masked Diffusion Language Models

Subham Sekhar Sahoo; Jean-Marie Lemercier; Zhihan Yang; Justin Deschenaux; Jingyu Liu; John Thickstun; Ante Jukic

Scaling Beyond Masked Diffusion Language Models

Subham Sekhar Sahoo, Jean-Marie Lemercier, Zhihan Yang, Justin Deschenaux, Jingyu Liu, John Thickstun, Ante Jukic

TL;DR

The paper investigates scaling laws for three discrete diffusion language-model families—Masked diffusion, Uniform-state diffusion, and interpolating diffusion—under compute-matched regimes and assesses both likelihood-based and sampling-based performance. It finds that while Masked diffusion often achieves the best perplexity, its scaling does not guarantee practical superiority; Uniform-state diffusion and Eso-LM can offer faster, more practical sampling and KV-caching benefits, respectively, shaping the speed-quality tradeoff. A low-variance training objective improves Masked diffusion efficiency by roughly $12\%$ of compute and shifts compute-optimal points to smaller models. Scaling experiments up to $1.7$B parameters reveal that, on likelihood benchmarks, AR remains strongest, while Duo excels on math and reasoning after supervised fine-tuning, underscoring that perplexity alone is insufficient for cross-algorithm evaluation and that downstream tasks and sampling efficiency are crucial for diffusion-LM design.

Abstract

Diffusion language models are a promising alternative to autoregressive models due to their potential for faster generation. Among discrete diffusion approaches, Masked diffusion currently dominates, largely driven by strong perplexity on language modeling benchmarks. In this work, we present the first scaling law study of uniform-state and interpolating discrete diffusion methods. We also show that Masked diffusion models can be made approximately 12% more FLOPs-efficient when trained with a simple cross-entropy objective. We find that perplexity is informative within a diffusion family but can be misleading across families, where models with worse likelihood scaling may be preferable due to faster and more practical sampling, as reflected by the speed-quality Pareto frontier. These results challenge the view that Masked diffusion is categorically the future of diffusion language modeling and that perplexity alone suffices for cross-algorithm comparison. Scaling all methods to 1.7B parameters, we show that uniform-state diffusion remains competitive on likelihood-based benchmarks and outperforms autoregressive and Masked diffusion models on GSM8K, despite worse validation perplexity. We provide the code, model checkpoints, and video tutorials on the project page: http://s-sahoo.github.io/scaling-dllms

Scaling Beyond Masked Diffusion Language Models

TL;DR

of compute and shifts compute-optimal points to smaller models. Scaling experiments up to

B parameters reveal that, on likelihood benchmarks, AR remains strongest, while Duo excels on math and reasoning after supervised fine-tuning, underscoring that perplexity alone is insufficient for cross-algorithm evaluation and that downstream tasks and sampling efficiency are crucial for diffusion-LM design.

Abstract

Paper Structure (39 sections, 17 equations, 6 figures, 4 tables)

This paper contains 39 sections, 17 equations, 6 figures, 4 tables.

Introduction
Background
Notation.
Autoregressive Models
Discrete Diffusion Models
Masked Diffusion Models
Forward process
Reverse process
Training
Uniform-state Diffusion Models
Forward Process
Reverse Process
Training
Esoteric Language Models
Forward and Reverse Processes
...and 24 more sections

Figures (6)

Figure 1: Speed-Quality Pareto Frontier. We report the highest throughput achieved by compute-optimal models across a range of training FLOPs budgets. AR produces the highest-quality samples but is slow. Sample diversity (measured by entropy) remains broadly similar across algorithms, with Duo exhibiting slightly reduced diversity; see Fig. \ref{['fig:all-quality-entropy']}. Duo dominates in the throughput ranges $[200, 400] \cup [600, \infty]$, while Eso-LM dominates in the range $[400, 600]$.
Figure 2: IsoFLOP Analysis under fixed computation budgets.
Figure 3: Scaling Laws. Diffusion models exhibit similar scaling behavior wrt AR models.
Figure 4: Throughput (toks / sec; $\uparrow$) vs time discretization $T$ for various diffusion models.
Figure 5: Gen. PPL (sample quality; $\downarrow$) and entropy (sample diversity; $\uparrow$) vs time discretization ($T$) for (a) MDLM w/ ancestral sampler, (b) Eso-LM w/ block sampler, and (c) Duo w/ ancestral sampler.
...and 1 more figures

Scaling Beyond Masked Diffusion Language Models

TL;DR

Abstract

Scaling Beyond Masked Diffusion Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (6)