Table of Contents
Fetching ...

Loss-to-Loss Prediction: Scaling Laws for All Datasets

David Brandfonbrener, Nikhil Anand, Nikhil Vyas, Eran Malach, Sham Kakade

TL;DR

The paper introduces loss-to-loss prediction, a framework for translating scaling-law fits between data distributions to probe how pre-training and downstream distributions affect loss. By modeling cross-distribution relationships with a shifted power-law form and a unified parameterization, it derives train-to-train, train-to-test, and test-to-test translations, enabling extrapolation beyond the original data budgets and offering invariance of the compute-optimal model size under distribution shifts. Empirically, the approach works across six pre-training datasets and multiple downstream tasks, showing that data mixing can yield more accurate scaling laws than fitting independently on each dataset, and that downstream loss is a stable proxy for transfer performance. The work provides both theoretical insights and practical tools for data selection, transfer learning, and forecasting large-model performance with limited new runs, while outlining limitations related to irreducible entropy estimation and task diversity.

Abstract

While scaling laws provide a reliable methodology for predicting train loss across compute scales for a single data distribution, less is known about how these predictions should change as we change the distribution. In this paper, we derive a strategy for predicting one loss from another and apply it to predict across different pre-training datasets and from pre-training data to downstream task data. Our predictions extrapolate well even at 20x the largest FLOP budget used to fit the curves. More precisely, we find that there are simple shifted power law relationships between (1) the train losses of two models trained on two separate datasets when the models are paired by training compute (train-to-train), (2) the train loss and the test loss on any downstream distribution for a single model (train-to-test), and (3) the test losses of two models trained on two separate train datasets (test-to-test). The results hold up for pre-training datasets that differ substantially (some are entirely code and others have no code at all) and across a variety of downstream tasks. Finally, we find that in some settings these shifted power law relationships can yield more accurate predictions than extrapolating single-dataset scaling laws.

Loss-to-Loss Prediction: Scaling Laws for All Datasets

TL;DR

The paper introduces loss-to-loss prediction, a framework for translating scaling-law fits between data distributions to probe how pre-training and downstream distributions affect loss. By modeling cross-distribution relationships with a shifted power-law form and a unified parameterization, it derives train-to-train, train-to-test, and test-to-test translations, enabling extrapolation beyond the original data budgets and offering invariance of the compute-optimal model size under distribution shifts. Empirically, the approach works across six pre-training datasets and multiple downstream tasks, showing that data mixing can yield more accurate scaling laws than fitting independently on each dataset, and that downstream loss is a stable proxy for transfer performance. The work provides both theoretical insights and practical tools for data selection, transfer learning, and forecasting large-model performance with limited new runs, while outlining limitations related to irreducible entropy estimation and task diversity.

Abstract

While scaling laws provide a reliable methodology for predicting train loss across compute scales for a single data distribution, less is known about how these predictions should change as we change the distribution. In this paper, we derive a strategy for predicting one loss from another and apply it to predict across different pre-training datasets and from pre-training data to downstream task data. Our predictions extrapolate well even at 20x the largest FLOP budget used to fit the curves. More precisely, we find that there are simple shifted power law relationships between (1) the train losses of two models trained on two separate datasets when the models are paired by training compute (train-to-train), (2) the train loss and the test loss on any downstream distribution for a single model (train-to-test), and (3) the test losses of two models trained on two separate train datasets (test-to-test). The results hold up for pre-training datasets that differ substantially (some are entirely code and others have no code at all) and across a variety of downstream tasks. Finally, we find that in some settings these shifted power law relationships can yield more accurate predictions than extrapolating single-dataset scaling laws.

Paper Structure

This paper contains 34 sections, 25 equations, 20 figures, 9 tables.

Figures (20)

  • Figure 1: (Left) Train-to-train prediction from FineWeb-edu to all 6 training sets. Each datapoint represents a pair of models that are "joined" on model size $N$ and dataset size $D$. Dashed lines represent extrapolation and stars represent 3.3B models trained with 20x compute of the largest dot. These large models are not used to fit the curves. (Center) Test-to-test prediction of Hellaswag cross entropy loss between models trained on FineWeb-edu and models trained on the other datasets. Again each datapoint represents two models joined on model and dataset size. The downstream loss is the cross entropy loss of the correct answer to the multiple choice problem when phrased as a cloze task. (Right) Train-to-test prediction from FineWeb-edu to four downstream tasks. Each datapoint represents a single model and its "transfer" performance on the val data.
  • Figure 2: Train-to-train fits. Each point on the plot represents the final loss of two models: $\hat{f}_0^{N,D}$ which is trained on dataset 0 and $\hat{f}_1^{N,D}$ which is trained on dataset 1. The models are paired when they use the same number of parameters $N$ and tokens $D$. Starred points indicate a large model trained for the purpose of testing the extrapolation of the curves, which are only fit on the dotted points.
  • Figure 3: Train-to-test fits. Each datapoint represents a single model trained on the dataset in the subplot title and then evaluated on a different dataset as indicated by the color.
  • Figure 4: Train-to-test transfer for downstream tasks. On the test set we evaluate the CE loss of the correct multiple choice answer as a cloze task.
  • Figure 5: Test-to-test predictions for downstream tasks. Each subplot illustrates a different downstream task. The x-axis always reports the test loss for models trained on FineWeb-edu, and the y-axis shows test loss for all 6 of the different training distributions. Each point represents two models, joined when they share the same model size and training dataset size.
  • ...and 15 more figures