Table of Contents
Fetching ...

Small-to-Large Generalization: Data Influences Models Consistently Across Scale

Alaa Khaddaj, Logan Engstrom, Aleksander Madry

TL;DR

The paper tackles the challenge of understanding how training data distributions affect large-scale model behavior without incurring prohibitive compute costs. By systematically comparing losses from large reference models and varyingly sized proxies across diverse data distributions, it shows that data influence is broadly consistent across compute scale, though correlation strength depends on proxy size and task. It also demonstrates the utility of proxy models in two downstream tasks—data attribution in vision and dataset selection for language models—using TRAK-based datamodels and the DsDm framework, while noting limitations at very small proxy scales. Overall, the work provides empirical guidance for using small proxies to study data influence on large models, enabling cost-efficient analysis and practical applications in data management. These findings inform when proxy models are reliable and how to balance proxy size against accuracy in real-world data-centric ML workflows.

Abstract

Choice of training data distribution greatly influences model behavior. Yet, in large-scale settings, precisely characterizing how changes in training data affects predictions is often difficult due to model training costs. Current practice is to instead extrapolate from scaled down, inexpensive-to-train proxy models. However, changes in data do not influence smaller and larger models identically. Therefore, understanding how choice of data affects large-scale models raises the question: how does training data distribution influence model behavior across compute scale? We find that small- and large-scale language model predictions (generally) do highly correlate across choice of training data. Equipped with these findings, we characterize how proxy scale affects effectiveness in two downstream proxy model applications: data attribution and dataset selection.

Small-to-Large Generalization: Data Influences Models Consistently Across Scale

TL;DR

The paper tackles the challenge of understanding how training data distributions affect large-scale model behavior without incurring prohibitive compute costs. By systematically comparing losses from large reference models and varyingly sized proxies across diverse data distributions, it shows that data influence is broadly consistent across compute scale, though correlation strength depends on proxy size and task. It also demonstrates the utility of proxy models in two downstream tasks—data attribution in vision and dataset selection for language models—using TRAK-based datamodels and the DsDm framework, while noting limitations at very small proxy scales. Overall, the work provides empirical guidance for using small proxies to study data influence on large models, enabling cost-efficient analysis and practical applications in data management. These findings inform when proxy models are reliable and how to balance proxy size against accuracy in real-world data-centric ML workflows.

Abstract

Choice of training data distribution greatly influences model behavior. Yet, in large-scale settings, precisely characterizing how changes in training data affects predictions is often difficult due to model training costs. Current practice is to instead extrapolate from scaled down, inexpensive-to-train proxy models. However, changes in data do not influence smaller and larger models identically. Therefore, understanding how choice of data affects large-scale models raises the question: how does training data distribution influence model behavior across compute scale? We find that small- and large-scale language model predictions (generally) do highly correlate across choice of training data. Equipped with these findings, we characterize how proxy scale affects effectiveness in two downstream proxy model applications: data attribution and dataset selection.

Paper Structure

This paper contains 91 sections, 27 equations, 23 figures, 6 tables, 3 algorithms.

Figures (23)

  • Figure 1: Proxy-model test loss highly correlates with large-model test loss across choice of training data distribution, even across a large gap in scale. Above, we plot the losses of a small-scale proxy (57M parameters) compared to that of the reference model (760M parameters). Here, the small scale model trains with 175× less compare than the reference model. Each column represents model loss on a different test distribution, ranging from LM benchmarks (SQuAD/HellaSwag) to pretraining data distributions (the Pile).
  • Figure 2: Correlation between large- and small-scale model predictions is consistently high, even across large gaps (orders of magnitude) in training compute scale. We plot small- to large-scale correlation against small-scale proxy model compute. There is also large variation across choice of test set: correlation is consistently high on four of six tasks, while losses on SQuAD and TriviaQA correlate less.
  • Figure 3: Proxy models can be highly predictive of large-scale model predictions even when predicting as well as randomly on a given test set. We plot small- to large-scale loss correlation against small-scale proxy model accuracy on the given task, normalized to show improvement over outputting a random guess (in absolute accuracy). On a number of test sets, proxy models perform no better than random guessing, but still highly correlate with the reference model (which always achieves significantly better than random guessing).
  • Figure 4: Proxy model predictions can highly correlate with those of the reference model on individual test samples. We visualize loss on individual samples for each scale model across varying training datasets. The proxy model here is 57M parameters, training with around 175× the compute of the 760M reference model. See a distributional plot (showing the correlation across all samples on each test set) in Figure \ref{['fig:small_large_r2_dist']}.
  • Figure 5: The correlation between large- and small-scale model losses on individual samples is highly dependent on the test distribution. We show a histogram of the correlation between large model and proxy model predictions on individual test samples for the test distribution in each column. We plot the coefficient of determination ($R^2$) between the losses of the small and large models on all examples in the downstream task.
  • ...and 18 more figures