Table of Contents
Fetching ...

Diffusion Beats Autoregressive in Data-Constrained Settings

Mihir Prabhudesai, Mengning Wu, Amir Zadeh, Katerina Fragkiadaki, Deepak Pathak

TL;DR

The paper shows that diffusion-based language models, when trained on repeated data, outperform autoregressive models in data-constrained settings despite higher single-epoch compute. By deriving data-reuse scaling laws and a critical compute frontier, it demonstrates that diffusion can leverage repeated data far more effectively (RD* significantly larger for diffusion) and yields better downstream performance. The authors attribute this to diffusion’s exposure to diverse token orderings, which acts as implicit data augmentation beyond AR’s fixed left-to-right factorization. These findings suggest shifting emphasis toward diffusion strategies in data-scarce regimes and hint at promising AR-diffusion hybrids for balanced compute-data tradeoffs.

Abstract

Autoregressive (AR) models have long dominated the landscape of large language models, driving progress across a wide range of tasks. Recently, diffusion-based language models have emerged as a promising alternative, though their advantages over AR models remain underexplored. In this paper, we systematically study masked diffusion models in data-constrained settings where training involves repeated passes over limited data and find that they significantly outperform AR models when compute is abundant but data is scarce. Diffusion models make better use of repeated data, achieving lower validation loss and superior downstream performance. We find new scaling laws for diffusion models and derive a closed-form expression for the critical compute threshold at which diffusion begins to outperform AR. Finally, we explain why diffusion models excel in this regime: their randomized masking objective implicitly trains over a rich distribution of token orderings, acting as an implicit data augmentation that AR's fixed left-to-right factorization lacks. Our results suggest that when data, not compute, is the bottleneck, diffusion models offer a compelling alternative to the standard AR paradigm. Our code is available at: https://diffusion-scaling.github.io.

Diffusion Beats Autoregressive in Data-Constrained Settings

TL;DR

The paper shows that diffusion-based language models, when trained on repeated data, outperform autoregressive models in data-constrained settings despite higher single-epoch compute. By deriving data-reuse scaling laws and a critical compute frontier, it demonstrates that diffusion can leverage repeated data far more effectively (RD* significantly larger for diffusion) and yields better downstream performance. The authors attribute this to diffusion’s exposure to diverse token orderings, which acts as implicit data augmentation beyond AR’s fixed left-to-right factorization. These findings suggest shifting emphasis toward diffusion strategies in data-scarce regimes and hint at promising AR-diffusion hybrids for balanced compute-data tradeoffs.

Abstract

Autoregressive (AR) models have long dominated the landscape of large language models, driving progress across a wide range of tasks. Recently, diffusion-based language models have emerged as a promising alternative, though their advantages over AR models remain underexplored. In this paper, we systematically study masked diffusion models in data-constrained settings where training involves repeated passes over limited data and find that they significantly outperform AR models when compute is abundant but data is scarce. Diffusion models make better use of repeated data, achieving lower validation loss and superior downstream performance. We find new scaling laws for diffusion models and derive a closed-form expression for the critical compute threshold at which diffusion begins to outperform AR. Finally, we explain why diffusion models excel in this regime: their randomized masking objective implicitly trains over a rich distribution of token orderings, acting as an implicit data augmentation that AR's fixed left-to-right factorization lacks. Our results suggest that when data, not compute, is the bottleneck, diffusion models offer a compelling alternative to the standard AR paradigm. Our code is available at: https://diffusion-scaling.github.io.

Paper Structure

This paper contains 30 sections, 13 equations, 9 figures, 5 tables, 2 algorithms.

Figures (9)

  • Figure 1: Pareto frontier of validation loss versus training FLOPs for autoregressive (AR) and masked diffusion models under data-constrained settings. Each point represents a model trained until convergence; we report the best validation loss achieved among all models using less than or equal to the FLOPs shown on the x-axis. AR models initially outperform diffusion models, particularly near the Chinchilla-optimal compute point hoffmann2022chinchilla (indicated on the plot). However, as training continues beyond this regime with repeated data, AR models quickly saturate and begin to overfit. In contrast, diffusion models continue to improve with more compute and exhibit no signs of overfitting.
  • Figure 2: Validation loss contours over epochs and model sizes for autoregressive (left) and diffusion (right) models, trained on 100M unique tokens. Each plot shows validation loss as a function of training epochs (x-axis) and model parameters (y-axis). The colored star marks the compute-optimal point for single-epoch training, as predicted by prior scaling laws hoffmann2022chinchillanie2019adversarial, and the black star indicates the lowest validation loss achieved through extended multi-epoch training. In the single-epoch regime, diffusion models perform worse than AR models (10.65 vs. 7.07). However, when trained longer, diffusion models achieve a substantially lower final loss (3.55 vs. 3.71). This corresponds to a 67% reduction in loss for diffusion models compared to just 48% for AR models, highlighting their superior ability to leverage repeated data.
  • Figure 3: Decay rate of data value under repetition: left shows diffusion, middle AR, and right the average decay rate for both. Points are empirical results (darker color = higher FLOPs, lighter color = lower FLOPs; each line = fixed compute), we find that fitted curves (represented as lines) closely match the empirical points, indicating our scaling laws are representative. The decay rate of value for repeated data is lower for diffusion, reflecting its greater robustness to repeating.
  • Figure 4: Training curves for different epoch counts, all with using the same total compute. Each curve shows a different tradeoff between unique data and repetition. For AR models, validation loss rises with more epochs (overfitting), while for diffusion models, the curves are nearly unchanged, showing much greater robustness to data repetition.
  • Figure 5: Predicted validation loss for AR (left) and Diffusion models (right) under compute-optimal settings, extrapolated to larger compute budgets. Dotted lines show the hypothetical case where repeated data equals new data. For AR, this holds up to $\approx$4 epochs; for diffusion, up to $\approx$100 epochs—showing diffusion’s greater robustness to data repetition. Note that loss values between AR and diffusion are not directly comparable, as they’re extrapolated from scaling laws with different data-entropy terms ($E_0$). In Section \ref{['sec:why']}, we ignore this factor during comparison.
  • ...and 4 more figures