DataDecide: How to Predict Best Pretraining Data with Small Experiments

Ian Magnusson; Nguyen Tai; Ben Bogin; David Heineman; Jena D. Hwang; Luca Soldaini; Akshita Bhagia; Jiacheng Liu; Dirk Groeneveld; Oyvind Tafjord; Noah A. Smith; Pang Wei Koh; Jesse Dodge

DataDecide: How to Predict Best Pretraining Data with Small Experiments

Ian Magnusson, Nguyen Tai, Ben Bogin, David Heineman, Jena D. Hwang, Luca Soldaini, Akshita Bhagia, Jiacheng Liu, Dirk Groeneveld, Oyvind Tafjord, Noah A. Smith, Pang Wei Koh, Jesse Dodge

TL;DR

DataDecide tackles the challenge of cost-effective data decisions for pretraining large language models by building an open, broad suite of 1050 models across 25 data recipes and 14 scales. It compares single-scale ranking against multi-scale scaling-law extrapolations to predict large-scale outcomes, using a diverse set of downstream tasks (OLMES) for evaluation. The study finds that simple small-scale ranking is a strong predictor of large-scale winners (about 80% accuracy) and that existing scaling-law baselines offer no clear compute advantages over this baseline; continuous likelihood proxies further enhance predictability for several tasks. These insights yield practical guidance on how to allocate compute for data decisions and how to choose evaluation metrics to improve reliability in data-driven pretraining decisions.

Abstract

Because large language models are expensive to pretrain on different datasets, using smaller-scale experiments to decide on data is crucial for reducing costs. Which benchmarks and methods of making decisions from observed performance at small scale most accurately predict the datasets that yield the best large models? To empower open exploration of this question, we release models, data, and evaluations in DataDecide -- the most extensive open suite of models over differences in data and scale. We conduct controlled pretraining experiments across 25 corpora with differing sources, deduplication, and filtering up to 100B tokens, model sizes up to 1B parameters, and 3 random seeds. We find that the ranking of models at a single, small size (e.g., 150M parameters) is a strong baseline for predicting best models at our larger target scale (1B) (~80% of com parisons correct). No scaling law methods among 8 baselines exceed the compute-decision frontier of single-scale predictions, but DataDecide can measure improvement in future scaling laws. We also identify that using continuous likelihood metrics as proxies in small experiments makes benchmarks including MMLU, ARC, HellaSwag, MBPP, and HumanEval >80% predictable at the target 1B scale with just 0.01% of the compute.

DataDecide: How to Predict Best Pretraining Data with Small Experiments

TL;DR

Abstract

DataDecide: How to Predict Best Pretraining Data with Small Experiments

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)