Scaling laws for learning with real and surrogate data

Ayush Jain; Andrea Montanari; Eren Sasoglu

Scaling laws for learning with real and surrogate data

Ayush Jain, Andrea Montanari, Eren Sasoglu

TL;DR

The paper addresses the challenge of improving learning when target data are scarce by incorporating surrogate data through a weighted ERM framework with an adjustable surrogate weight $oldsymboleta$ and regularizer $oldsymbol heta$. It develops a unifying scaling-law paradigm that predicts how test error scales with original and surrogate sample sizes $(n,m)$ and weight $oldsymboleta$ across multiple theoretical models (Gaussian sequence, Sobolev regression, low-dimension, and high-dimension ridge) and validates these predictions with diverse empirical tasks (NLP sentiment analysis, image classification, and genomic survival analysis). A key theoretical insight is that surrogate data act as a regulated shrinkage term, analogous to Stein's paradox, and that optimal weighting consistently yields improvements even when surrogate data are not closely aligned with the target distribution. The practical impact is a principled method to decide how many surrogate samples to collect and how to weight them during training, enabling cost-effective data augmentation and improved generalization on the original distribution.

Abstract

Collecting large quantities of high-quality data can be prohibitively expensive or impractical, and a bottleneck in machine learning. One may instead augment a small set of $n$ data points from the target distribution with data from more accessible sources, e.g. data collected under different circumstances or synthesized by generative models. We refer to such data as `surrogate data'. We study a weighted empirical risk minimization (ERM) approach for integrating surrogate data into training. We analyze mathematically this method under several classical statistical models, and validate our findings empirically on datasets from different domains. Our main findings are: $(i)$ Integrating surrogate data can significantly reduce the test error on the original distribution. Surprisingly, this can happen even when the surrogate data is unrelated to the original ones. We trace back this behavior to the classical Stein's paradox. $(ii)$ In order to reap the benefit of surrogate data, it is crucial to use optimally weighted ERM. $(iii)$ The test error of models trained on mixtures of real and surrogate data is approximately described by a scaling law. This scaling law can be used to predict the optimal weighting scheme, and to choose the amount of surrogate data to add.

Scaling laws for learning with real and surrogate data

TL;DR

The paper addresses the challenge of improving learning when target data are scarce by incorporating surrogate data through a weighted ERM framework with an adjustable surrogate weight

and regularizer

. It develops a unifying scaling-law paradigm that predicts how test error scales with original and surrogate sample sizes

and weight

across multiple theoretical models (Gaussian sequence, Sobolev regression, low-dimension, and high-dimension ridge) and validates these predictions with diverse empirical tasks (NLP sentiment analysis, image classification, and genomic survival analysis). A key theoretical insight is that surrogate data act as a regulated shrinkage term, analogous to Stein's paradox, and that optimal weighting consistently yields improvements even when surrogate data are not closely aligned with the target distribution. The practical impact is a principled method to decide how many surrogate samples to collect and how to weight them during training, enabling cost-effective data augmentation and improved generalization on the original distribution.

Abstract

Collecting large quantities of high-quality data can be prohibitively expensive or impractical, and a bottleneck in machine learning. One may instead augment a small set of

data points from the target distribution with data from more accessible sources, e.g. data collected under different circumstances or synthesized by generative models. We refer to such data as `surrogate data'. We study a weighted empirical risk minimization (ERM) approach for integrating surrogate data into training. We analyze mathematically this method under several classical statistical models, and validate our findings empirically on datasets from different domains. Our main findings are:

Integrating surrogate data can significantly reduce the test error on the original distribution. Surprisingly, this can happen even when the surrogate data is unrelated to the original ones. We trace back this behavior to the classical Stein's paradox.

In order to reap the benefit of surrogate data, it is crucial to use optimally weighted ERM.

The test error of models trained on mixtures of real and surrogate data is approximately described by a scaling law. This scaling law can be used to predict the optimal weighting scheme, and to choose the amount of surrogate data to add.

Paper Structure (36 sections, 7 theorems, 89 equations, 30 figures)

This paper contains 36 sections, 7 theorems, 89 equations, 30 figures.

Introduction and overview
Motivation and formulation
Summary of results
Related work
Regularization, Gaussian mean estimation, Stein paradox
Theoretical results
Gaussian sequence model
Non-parametric regression in Sobolev classes
Low-dimensional asymptotics
High-dimensional linear regression
Empirical results
Binary classification with Gaussian mixture data
Linear regression with Gaussian mixture data
Sentiment analysis in movie reviews
Image classification with CIFAR10 and CIFAR100
...and 21 more sections

Key Result

Theorem 1

Let $\omega_1\le \omega_2\le \cdots$ be the ordered eigenvalues of ${\boldsymbol \Omega}$, and denote by ${\boldsymbol v}_i$ the corresponding eigenvectors. Further denote by ${\boldsymbol \theta}_{*,>k}$, ${\boldsymbol \theta}^s_{*,>k}$ the projections of ${\boldsymbol \theta}_*$, ${\boldsymbol \th

Figures (30)

Figure 1: IMDB and Rotten Tomatoes data and neural networks. Test error when trained on mixtures of original and surrogate data. Black curves: prediction from Eq. \ref{['eq:FirstScaling']}.
Figure 2: Performance of unweighted vs weighted ERM approach for the setting in Figure \ref{['fig:rottenAlphaNN']}
Figure 3: CIFAR10 and CIFAR100 data. Test error when trained on mixtures of original and surrogate data. Black curves: prediction from Eq. \ref{['eq:FirstScaling']}.
Figure 4: Lasso-based Cox regression on TCGA PanCancer dataset. Test error when trained on mixtures of original and surrogate data. Black curves: prediction from Eq. \ref{['eq:FirstScaling']}.
Figure 5: Ridge regression on simulated data. Here $d=500$, $n=1000$, $\sigma^2=\sigma_s^2=1$, $\|{\boldsymbol \theta}_*\|=\|{\boldsymbol \theta}_{*,s}\|=1$, regul. par. $\lambda=2^{-10}$, and $m$ varies by column. Top row $\gamma=\pi/2$, bottom row $\gamma=\pi/6$.
...and 25 more figures

Theorems & Definitions (15)

Theorem 1
Theorem 2
Remark 3.1
Theorem 3
Remark 3.2: Optimizing $\alpha$ over the validation set
Remark 3.3: Relation to scaling laws
Proposition B.1
Remark B.1
Remark B.2
Remark B.3
...and 5 more

Scaling laws for learning with real and surrogate data

TL;DR

Abstract

Scaling laws for learning with real and surrogate data

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (30)

Theorems & Definitions (15)