Small-to-Large Generalization: Data Influences Models Consistently Across Scale
Alaa Khaddaj, Logan Engstrom, Aleksander Madry
TL;DR
The paper tackles the challenge of understanding how training data distributions affect large-scale model behavior without incurring prohibitive compute costs. By systematically comparing losses from large reference models and varyingly sized proxies across diverse data distributions, it shows that data influence is broadly consistent across compute scale, though correlation strength depends on proxy size and task. It also demonstrates the utility of proxy models in two downstream tasks—data attribution in vision and dataset selection for language models—using TRAK-based datamodels and the DsDm framework, while noting limitations at very small proxy scales. Overall, the work provides empirical guidance for using small proxies to study data influence on large models, enabling cost-efficient analysis and practical applications in data management. These findings inform when proxy models are reliable and how to balance proxy size against accuracy in real-world data-centric ML workflows.
Abstract
Choice of training data distribution greatly influences model behavior. Yet, in large-scale settings, precisely characterizing how changes in training data affects predictions is often difficult due to model training costs. Current practice is to instead extrapolate from scaled down, inexpensive-to-train proxy models. However, changes in data do not influence smaller and larger models identically. Therefore, understanding how choice of data affects large-scale models raises the question: how does training data distribution influence model behavior across compute scale? We find that small- and large-scale language model predictions (generally) do highly correlate across choice of training data. Equipped with these findings, we characterize how proxy scale affects effectiveness in two downstream proxy model applications: data attribution and dataset selection.
