Training Optimal Large Diffusion Language Models

Jinjie Ni; Qian Liu; Chao Du; Longxu Dou; Hang Yan; Zili Wang; Tianyu Pang; Michael Qizhe Shieh

Training Optimal Large Diffusion Language Models

Jinjie Ni, Qian Liu, Chao Du, Longxu Dou, Hang Yan, Zili Wang, Tianyu Pang, Michael Qizhe Shieh

TL;DR

Quokka provides the first systemic scaling laws for diffusion language models across compute- and data-constrained regimes, revealing that under fixed compute the optimal $N$ and $D$ scale nearly proportionally with compute $C$ (with exponents $a\approx b\approx 0.5$) and that DLMs are more data-hungry than autoregressive models by a factor of roughly $2$–$5$. It extends scaling theory to data constraints by modeling the validation loss with an effective data size $D'$ that captures repeated data and overfitting via epochs $e$ and unique data budget $U_D$, enabling predictions of optimal $(N,e)$ and $(N,D)$ under data limits. The work introduces a data-constrained diffusion loss formulation and demonstrates that masked diffusion, simple diffusion schedules, and AR-derived hyperparameters (batch size, learning rate) transfer well to DLMs, while weight decay and multi-epoch strategies modulate overfitting. Together, these results provide actionable guidance for training diffusion language models efficiently and highlight practical trade-offs between model size, data, and training duration in real-world settings, with implications for both short-term practice and long-term AI research.

Abstract

We introduce Quokka, the first systematic scaling law for diffusion language models (DLMs), encompassing both compute-constrained and data-constrained regimes, and studying the key modeling and optimization designs. Quokka is a good friend of Chinchilla and provides wider scopes. We hope the results would bring short-term practical guidance in DLMs training and long-term inspirations for the whole AI community.

Training Optimal Large Diffusion Language Models

TL;DR

Quokka provides the first systemic scaling laws for diffusion language models across compute- and data-constrained regimes, revealing that under fixed compute the optimal

and

scale nearly proportionally with compute

(with exponents

) and that DLMs are more data-hungry than autoregressive models by a factor of roughly

–

. It extends scaling theory to data constraints by modeling the validation loss with an effective data size

that captures repeated data and overfitting via epochs

and unique data budget

, enabling predictions of optimal

and

under data limits. The work introduces a data-constrained diffusion loss formulation and demonstrates that masked diffusion, simple diffusion schedules, and AR-derived hyperparameters (batch size, learning rate) transfer well to DLMs, while weight decay and multi-epoch strategies modulate overfitting. Together, these results provide actionable guidance for training diffusion language models efficiently and highlight practical trade-offs between model size, data, and training duration in real-world settings, with implications for both short-term practice and long-term AI research.

Training Optimal Large Diffusion Language Models

TL;DR

Abstract

Training Optimal Large Diffusion Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (23)