Table of Contents
Fetching ...

AutoScale: Scale-Aware Data Mixing for Pre-Training LLMs

Feiyang Kang, Yifan Sun, Bingbing Wen, Si Chen, Dawn Song, Rafid Mahmood, Ruoxi Jia

TL;DR

AutoScale tackles the problem that domain data mixes optimized at small scales do not reliably transfer to large-scale LLM pre-training. It introduces a two-stage approach: Direct Data Optimization builds a scalable surrogate mapping from domain weights to validation loss at manageable scales, and Optimal Mix Projection uses a theoretical scale-aware law to extrapolate the optimal mix to larger budgets without retraining. The method is validated on GPT-2 Large and BERT, showing faster convergence and improved downstream performance, with notable insights that diverse domains become more valuable at larger scales. This work provides a practical, scalable path for scale-dependent data curation in LLM pre-training and opens avenues for broader application of scale-aware data mixing strategies.

Abstract

Domain reweighting is an emerging research area aimed at adjusting the relative weights of different data sources to improve the effectiveness and efficiency of LLM pre-training. We show that data mixtures that perform well at smaller scales may not retain their advantage at larger scales, challenging the existing practice of determining competitive mixtures in small-scale experiments and directly applying them at much larger scales. To address this, we propose AutoScale, a two-stage, scale-aware data composition framework. First, AutoScale fits a parametric model that predicts the model's loss under different data compositions, then uses it to find an approximate best allocation at smaller, more manageable budgets. Next, leveraging a novel theoretical analysis of how optimal compositions evolve with scale, AutoScale extrapolates that composition to larger budgets without further retraining. Empirically, AutoScale accelerates convergence and improves downstream performance. For instance, when pre-training GPT-2 Large, it achieves a 28% faster perplexity reduction than baselines and up to a 38% speed-up over unweighted training, while yielding best-average results on various downstream tasks. Overall, our findings illustrate how domain importance shifts with training scale, underscoring the need for scale-dependent data curation in LLM training. Our code is open-sourced.

AutoScale: Scale-Aware Data Mixing for Pre-Training LLMs

TL;DR

AutoScale tackles the problem that domain data mixes optimized at small scales do not reliably transfer to large-scale LLM pre-training. It introduces a two-stage approach: Direct Data Optimization builds a scalable surrogate mapping from domain weights to validation loss at manageable scales, and Optimal Mix Projection uses a theoretical scale-aware law to extrapolate the optimal mix to larger budgets without retraining. The method is validated on GPT-2 Large and BERT, showing faster convergence and improved downstream performance, with notable insights that diverse domains become more valuable at larger scales. This work provides a practical, scalable path for scale-dependent data curation in LLM pre-training and opens avenues for broader application of scale-aware data mixing strategies.

Abstract

Domain reweighting is an emerging research area aimed at adjusting the relative weights of different data sources to improve the effectiveness and efficiency of LLM pre-training. We show that data mixtures that perform well at smaller scales may not retain their advantage at larger scales, challenging the existing practice of determining competitive mixtures in small-scale experiments and directly applying them at much larger scales. To address this, we propose AutoScale, a two-stage, scale-aware data composition framework. First, AutoScale fits a parametric model that predicts the model's loss under different data compositions, then uses it to find an approximate best allocation at smaller, more manageable budgets. Next, leveraging a novel theoretical analysis of how optimal compositions evolve with scale, AutoScale extrapolates that composition to larger budgets without further retraining. Empirically, AutoScale accelerates convergence and improves downstream performance. For instance, when pre-training GPT-2 Large, it achieves a 28% faster perplexity reduction than baselines and up to a 38% speed-up over unweighted training, while yielding best-average results on various downstream tasks. Overall, our findings illustrate how domain importance shifts with training scale, underscoring the need for scale-dependent data curation in LLM training. Our code is open-sourced.
Paper Structure (47 sections, 2 theorems, 33 equations, 11 figures, 10 tables, 2 algorithms)

This paper contains 47 sections, 2 theorems, 33 equations, 11 figures, 10 tables, 2 algorithms.

Key Result

Theorem B.1

Consider the following optimization problem For any two compute budgets $N^{(1)} \neq N^{(2)}$, let $\mathbf{N}^*(N^{(1)})$ and $\mathbf{N}^*(N^{(2)})$ be their respective minimizers. For any third data composition $\mathbf{N}(N^{(3)})$, if there exists some constant $k\in\mathbb{R}^+$ such that then, $\mathbf{N}(N^{(3)})$ is the minimizer for data budget $N^{(3)}=\sum_{i=1}^m N^{(3)}_i$, given

Figures (11)

  • Figure 1: Domain weights that excel at one scale may underperform at another. Weights $w_1$ and $w_2$ are obtained by running DDO (as introduced in Section \ref{['sec:ddo_text']}) at 0.3B and 1.2B, respectively.
  • Figure 2: Optimizing domain weights with DDO algorithm for pre-training Encoder-only LMs (BERT). DDO substantially reduces validation loss. After reweighting, all training domains' loss has decreased or remained unchanged. Out-of-domain loss on non-training domains also decreased considerably. Enhanced performance is observed on all GLUE tasks (eval metric: cola: Matt. corr., stsb: Pearson corr., rest: acc.) and SQuAD (acc.).
  • Figure 3: Training 774M Decoder-only LMs (GPT-2 Large) for 10B tokens (96k steps). AutoScale-predicted domain weights decrease test perplexity at least $28\%$ faster than any baseline with up to $38\%$ speed up, achieving best overall task performance.
  • Figure 4: Domain importance evolves with training data scales. (GPT-2 Large)
  • Figure 5: Illustration: optimal data composition scales in exponential-style functions with training data quantity.
  • ...and 6 more figures

Theorems & Definitions (6)

  • Theorem B.1: Scaling Law for Optimal Data Compositions (restated)
  • proof
  • Remark B.2: An example
  • Theorem 2: Scaling Latent Skills
  • proof
  • Remark 2: what happens when $\mathbf{A}$ is not invertible.