Training Optimal Large Diffusion Language Models
Jinjie Ni, Qian Liu, Chao Du, Longxu Dou, Hang Yan, Zili Wang, Tianyu Pang, Michael Qizhe Shieh
TL;DR
Quokka provides the first systemic scaling laws for diffusion language models across compute- and data-constrained regimes, revealing that under fixed compute the optimal $N$ and $D$ scale nearly proportionally with compute $C$ (with exponents $a\approx b\approx 0.5$) and that DLMs are more data-hungry than autoregressive models by a factor of roughly $2$–$5$. It extends scaling theory to data constraints by modeling the validation loss with an effective data size $D'$ that captures repeated data and overfitting via epochs $e$ and unique data budget $U_D$, enabling predictions of optimal $(N,e)$ and $(N,D)$ under data limits. The work introduces a data-constrained diffusion loss formulation and demonstrates that masked diffusion, simple diffusion schedules, and AR-derived hyperparameters (batch size, learning rate) transfer well to DLMs, while weight decay and multi-epoch strategies modulate overfitting. Together, these results provide actionable guidance for training diffusion language models efficiently and highlight practical trade-offs between model size, data, and training duration in real-world settings, with implications for both short-term practice and long-term AI research.
Abstract
We introduce Quokka, the first systematic scaling law for diffusion language models (DLMs), encompassing both compute-constrained and data-constrained regimes, and studying the key modeling and optimization designs. Quokka is a good friend of Chinchilla and provides wider scopes. We hope the results would bring short-term practical guidance in DLMs training and long-term inspirations for the whole AI community.
