Table of Contents
Fetching ...

What Makes Diffusion Language Models Super Data Learners?

Zitian Gao, Haoming Luo, Lynx Chen, Jason Klein Liu, Ran Tao, Joey Zhou, Bryan Dai

TL;DR

Token scarcity in LLM pretraining motivates investigating diffusion language models (DLMs) under limited data budgets. The study uses extensive ablations to reveal that random input masking, i.e., token dropout, is the dominant source of DLM data efficiency, while dropout and weight decay provide additional gains in multi-epoch training. A key finding is that autoregressive models equipped with stochastic regularization can match or exceed the data efficiency of DLMs, unifying the mechanisms behind multi-epoch data reuse. The results offer practical strategies to improve data efficiency in LLMs under the token crisis, with broad implications for future model design.

Abstract

Recent studies have shown that diffusion language models achieve remarkable data efficiency under limited-data constraints, yet the underlying mechanisms remain unclear. In this work, we perform extensive ablation experiments to disentangle the sources of this efficiency. Our results show that random masking of input tokens plays the dominant role. We further show that similar gains can be obtained through in MLP dropout and weight decay, indicating that stochastic regularization broadly enhances data efficiency in multi-epoch training. Our code is available at https://github.com/zitian-gao/data-efficiency.

What Makes Diffusion Language Models Super Data Learners?

TL;DR

Token scarcity in LLM pretraining motivates investigating diffusion language models (DLMs) under limited data budgets. The study uses extensive ablations to reveal that random input masking, i.e., token dropout, is the dominant source of DLM data efficiency, while dropout and weight decay provide additional gains in multi-epoch training. A key finding is that autoregressive models equipped with stochastic regularization can match or exceed the data efficiency of DLMs, unifying the mechanisms behind multi-epoch data reuse. The results offer practical strategies to improve data efficiency in LLMs under the token crisis, with broad implications for future model design.

Abstract

Recent studies have shown that diffusion language models achieve remarkable data efficiency under limited-data constraints, yet the underlying mechanisms remain unclear. In this work, we perform extensive ablation experiments to disentangle the sources of this efficiency. Our results show that random masking of input tokens plays the dominant role. We further show that similar gains can be obtained through in MLP dropout and weight decay, indicating that stochastic regularization broadly enhances data efficiency in multi-epoch training. Our code is available at https://github.com/zitian-gao/data-efficiency.

Paper Structure

This paper contains 31 sections, 15 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: The validation loss of the 3 methods in the first figure on the left, and the accuracies of the 3 methods on the downstream metrics Hellaswag and PIQA in the two figures on the right.
  • Figure 2: The validation loss of the 5 methods in the first figure on the left, and the accuracies of the 5 methods on the downstream metrics Hellaswag and PIQA in the two figures on the right.
  • Figure 3: The validation loss of the 4 methods in the first figure on the left, and the accuracies of the 4 methods on the downstream metrics Hellaswag and PIQA in the two figures on the right.
  • Figure 4: The validation loss of the 4 methods in the first figure on the left, and the accuracies of the 4 methods on the downstream metrics Hellaswag and PIQA in the two figures on the right.
  • Figure 5: The validation loss of the 4 methods in the first figure on the left, and the accuracies of the 4 methods on the downstream metrics Hellaswag and PIQA in the two figures on the right.