Table of Contents
Fetching ...

DataDecide: How to Predict Best Pretraining Data with Small Experiments

Ian Magnusson, Nguyen Tai, Ben Bogin, David Heineman, Jena D. Hwang, Luca Soldaini, Akshita Bhagia, Jiacheng Liu, Dirk Groeneveld, Oyvind Tafjord, Noah A. Smith, Pang Wei Koh, Jesse Dodge

TL;DR

DataDecide tackles the challenge of cost-effective data decisions for pretraining large language models by building an open, broad suite of 1050 models across 25 data recipes and 14 scales. It compares single-scale ranking against multi-scale scaling-law extrapolations to predict large-scale outcomes, using a diverse set of downstream tasks (OLMES) for evaluation. The study finds that simple small-scale ranking is a strong predictor of large-scale winners (about 80% accuracy) and that existing scaling-law baselines offer no clear compute advantages over this baseline; continuous likelihood proxies further enhance predictability for several tasks. These insights yield practical guidance on how to allocate compute for data decisions and how to choose evaluation metrics to improve reliability in data-driven pretraining decisions.

Abstract

Because large language models are expensive to pretrain on different datasets, using smaller-scale experiments to decide on data is crucial for reducing costs. Which benchmarks and methods of making decisions from observed performance at small scale most accurately predict the datasets that yield the best large models? To empower open exploration of this question, we release models, data, and evaluations in DataDecide -- the most extensive open suite of models over differences in data and scale. We conduct controlled pretraining experiments across 25 corpora with differing sources, deduplication, and filtering up to 100B tokens, model sizes up to 1B parameters, and 3 random seeds. We find that the ranking of models at a single, small size (e.g., 150M parameters) is a strong baseline for predicting best models at our larger target scale (1B) (~80% of com parisons correct). No scaling law methods among 8 baselines exceed the compute-decision frontier of single-scale predictions, but DataDecide can measure improvement in future scaling laws. We also identify that using continuous likelihood metrics as proxies in small experiments makes benchmarks including MMLU, ARC, HellaSwag, MBPP, and HumanEval >80% predictable at the target 1B scale with just 0.01% of the compute.

DataDecide: How to Predict Best Pretraining Data with Small Experiments

TL;DR

DataDecide tackles the challenge of cost-effective data decisions for pretraining large language models by building an open, broad suite of 1050 models across 25 data recipes and 14 scales. It compares single-scale ranking against multi-scale scaling-law extrapolations to predict large-scale outcomes, using a diverse set of downstream tasks (OLMES) for evaluation. The study finds that simple small-scale ranking is a strong predictor of large-scale winners (about 80% accuracy) and that existing scaling-law baselines offer no clear compute advantages over this baseline; continuous likelihood proxies further enhance predictability for several tasks. These insights yield practical guidance on how to allocate compute for data decisions and how to choose evaluation metrics to improve reliability in data-driven pretraining decisions.

Abstract

Because large language models are expensive to pretrain on different datasets, using smaller-scale experiments to decide on data is crucial for reducing costs. Which benchmarks and methods of making decisions from observed performance at small scale most accurately predict the datasets that yield the best large models? To empower open exploration of this question, we release models, data, and evaluations in DataDecide -- the most extensive open suite of models over differences in data and scale. We conduct controlled pretraining experiments across 25 corpora with differing sources, deduplication, and filtering up to 100B tokens, model sizes up to 1B parameters, and 3 random seeds. We find that the ranking of models at a single, small size (e.g., 150M parameters) is a strong baseline for predicting best models at our larger target scale (1B) (~80% of com parisons correct). No scaling law methods among 8 baselines exceed the compute-decision frontier of single-scale predictions, but DataDecide can measure improvement in future scaling laws. We also identify that using continuous likelihood metrics as proxies in small experiments makes benchmarks including MMLU, ARC, HellaSwag, MBPP, and HumanEval >80% predictable at the target 1B scale with just 0.01% of the compute.

Paper Structure

This paper contains 24 sections, 6 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Which pretraining data to use? Ideally, compare performance of large models with fixed configurations averaged over random seeds (left). In practice, cheaper, smaller-scale experiments are used (center). Here DataDecide measures accuracy of pairwise decisions between 25 pretraining corpora to find efficient prediction methods (right).
  • Figure 2: Accuracy in pairwise decisions on best data when evaluating on the 10 OLMES tasks with Accuracy (shown aggregated in Figure \ref{['fig:accuracy_vs_compute']}). Specific tasks have very distinct ranges of sensitivity, with some like ARC Easy being predictable at small scales and others like HellaSwag requiring substantially more compute to predict.
  • Figure 3: Decision accuracy over 8 baseline scaling law variants. At best, these approaches reach only the same compute to decision accuracy frontier as ranking single scale experiments. DataDecide can be used to iterate on future scaling law prediction methods.
  • Figure 4: Per-task decision accuracy using character normalized proxy metrics for Accuracy targets. 5 tasks benefit at smaller scales from using raw likelihood of answers (Correct Prob and Total Prob), as opposed to discrete Accuracy or continuous metrics that penalize probability on incorrect answers (Norm Correct Prob, Margin).
  • Figure 5: Why do some tasks or metrics get better or worse decision accuracy? At 150M with Correct Prob tasks like HellaSwag succeed with low run-to-run variance and tasks like SocialIQA widely spread the performance assigned to different pretraining data.
  • ...and 1 more figures