Jinjie Ni, Qian Liu, Longxu Dou, Chao Du, Zili Wang, Hang Yan, Tianyu Pang, Michael Qizhe Shieh
Abstract
Under strictly controlled pre-training settings, we observe a Crossover: when
unique data is limited, diffusion language models (DLMs) consistently surpass
autoregressive (AR) models by training for more epochs. The crossover shifts
later with more or higher-quality data, earlier with larger models, and
persists across dense and sparse architectures. We attribute the gains to three
compounding factors: (1) any-order modeling, (2) super-dense compute from
iterative bidirectional denoising, and (3) built-in Monte Carlo augmentation;
input or parameter noise improves AR under data constraint but cannot close the
gap. At scale, a 1.7B DLM trained with a ~1.5T-token compute budget on 10B
unique Python tokens overtakes an AR coder trained with strictly matched
settings. In addition, a 1B-parameter DLM achieves > 56% accuracy on HellaSwag
and > 33% on MMLU using only 1B tokens, without any special tricks, just by
repeating standard pre-training data. We also show that rising validation
cross-entropy does not imply degraded downstream performance in this regime.