Table of Contents
Fetching ...

Training Optimal Large Diffusion Language Models

Jinjie Ni, Qian Liu, Chao Du, Longxu Dou, Hang Yan, Zili Wang, Tianyu Pang, Michael Qizhe Shieh

TL;DR

Quokka provides the first systemic scaling laws for diffusion language models across compute- and data-constrained regimes, revealing that under fixed compute the optimal $N$ and $D$ scale nearly proportionally with compute $C$ (with exponents $a\approx b\approx 0.5$) and that DLMs are more data-hungry than autoregressive models by a factor of roughly $2$–$5$. It extends scaling theory to data constraints by modeling the validation loss with an effective data size $D'$ that captures repeated data and overfitting via epochs $e$ and unique data budget $U_D$, enabling predictions of optimal $(N,e)$ and $(N,D)$ under data limits. The work introduces a data-constrained diffusion loss formulation and demonstrates that masked diffusion, simple diffusion schedules, and AR-derived hyperparameters (batch size, learning rate) transfer well to DLMs, while weight decay and multi-epoch strategies modulate overfitting. Together, these results provide actionable guidance for training diffusion language models efficiently and highlight practical trade-offs between model size, data, and training duration in real-world settings, with implications for both short-term practice and long-term AI research.

Abstract

We introduce Quokka, the first systematic scaling law for diffusion language models (DLMs), encompassing both compute-constrained and data-constrained regimes, and studying the key modeling and optimization designs. Quokka is a good friend of Chinchilla and provides wider scopes. We hope the results would bring short-term practical guidance in DLMs training and long-term inspirations for the whole AI community.

Training Optimal Large Diffusion Language Models

TL;DR

Quokka provides the first systemic scaling laws for diffusion language models across compute- and data-constrained regimes, revealing that under fixed compute the optimal and scale nearly proportionally with compute (with exponents ) and that DLMs are more data-hungry than autoregressive models by a factor of roughly . It extends scaling theory to data constraints by modeling the validation loss with an effective data size that captures repeated data and overfitting via epochs and unique data budget , enabling predictions of optimal and under data limits. The work introduces a data-constrained diffusion loss formulation and demonstrates that masked diffusion, simple diffusion schedules, and AR-derived hyperparameters (batch size, learning rate) transfer well to DLMs, while weight decay and multi-epoch strategies modulate overfitting. Together, these results provide actionable guidance for training diffusion language models efficiently and highlight practical trade-offs between model size, data, and training duration in real-world settings, with implications for both short-term practice and long-term AI research.

Abstract

We introduce Quokka, the first systematic scaling law for diffusion language models (DLMs), encompassing both compute-constrained and data-constrained regimes, and studying the key modeling and optimization designs. Quokka is a good friend of Chinchilla and provides wider scopes. We hope the results would bring short-term practical guidance in DLMs training and long-term inspirations for the whole AI community.

Paper Structure

This paper contains 41 sections, 27 equations, 23 figures, 6 tables.

Figures (23)

  • Figure 1: Overlaid predictions from Chinchilla and Quokka (compute-constrained). We overlay the predictions from our approach 1 and 2, along with those from hoffmann2022training. Though scaling at the same pace, DLMs are 2--5$\times$ more data-hungry than AR models at the same FLOPs—favor smaller models and larger corpora. We mark the position of LLaDA nie2025large in the same space, finding that it's severely over-trained with 2$\times$ smaller models and 2$\times$ more corpora against the Quokka efficient frontier. Meanwhile, wo show the positions of opensource models, finding that most models are over-trained compared with the Chinchilla efficient frontier, except some models from the Llama family. Note that the token statistics are based on the numbers in their reports, which might not be strictly unique tokens. More discussions are detailed in § \ref{['sec:discussions']}.
  • Figure 2: IsoFLOP curves illustrating the final training loss for a fixed compute budget. For each curve, we vary the model size and adjust the number of training tokens to maintain constant total training FLOPs. The left panel reveals a distinct performance valley, indicating an optimal trade-off between model size and data for a given compute budget. Leveraging the minima of these curves, we extrapolate the scaling law for the optimal number of parameters and training tokens to larger compute regimes (center and right). The green point highlights our projection for an optimally-scaled model trained with the LLaDA compute budget.
  • Figure 3: Parametric fit of the loss function $L(N, D)$. Left: Iso-loss contours of our fitted model. The blue line indicates the efficient frontier—the trajectory of minimal compute (FLOPs) required to achieve a given loss value, which is linear in log-log space. Right: Several isoFLOPs cross-sections of the loss surface, corresponding to the dashed lines in the left panel. The real data points are also plotted for a comparison.
  • Figure 4: Final-step validation losses for models of varying sizes trained with different unique data budgets and epochs. We consistently observe a U-shaped relationship between model size and final validation loss for a fixed data budget, with a minority of runs exhibiting double descent. Larger model sizes tend to accelerate the onset of overfitting (the right side of the "U"), while increasing the number of unique tokens delays it. The minimum achievable loss improves as the amount of unique data increases. These empirical findings provide the motivation for our data-constrained scaling law.
  • Figure 5: The loss contours predicted by the fitted data-constrained loss $\hat{L}(N, U_D, e)$. We exhibit the $N$ - $U_D$ contours with different unique data budgets $U_D$. We observe a local optima within each observation scope and the optimal $N$ and $e$ consistently grow with $e$.
  • ...and 18 more figures