Table of Contents
Fetching ...

Scaling Laws for the Value of Individual Data Points in Machine Learning

Ian Covert, Wenlong Ji, Tatsunori Hashimoto, James Zou

TL;DR

The paper addresses how individual data points contribute to learning as the dataset grows, challenging aggregate scaling only by introducing individualized data scaling laws. It defines the marginal contribution of a point z as $\Delta(z, \mathcal{D})$ and posits that the expected marginal contribution over datasets of size k follows $\psi_k(z) \approx \frac{c(z)}{k^{\alpha(z)}}$, with point-specific parameters. The authors provide both empirical evidence across linear and nonlinear models and theoretical insights (α(z) typically in [1,1.5], α(z)=2 in linear regression) and develop two efficient estimators—the maximum-likelihood-based and an amortized neural estimator—to learn per-point scaling from limited, noisy observations. They further demonstrate practical applications to data valuation and data-subset selection, showing that scaling-law-informed scores can approximate distributional Shapley valuations and guide point additions in a size-dependent manner. Overall, this work initiates a data-centric view of how data value scales with dataset size and offers practical tools for estimating and leveraging these scales in real-world datasets.

Abstract

Recent works have shown that machine learning models improve at a predictable rate with the total amount of training data, leading to scaling laws that describe the relationship between error and dataset size. These scaling laws can help design a model's training dataset, but they typically take an aggregate view of the data by only considering the dataset's size. We introduce a new perspective by investigating scaling behavior for the value of individual data points: we find that a data point's contribution to model's performance shrinks predictably with the size of the dataset in a log-linear manner. Interestingly, there is significant variability in the scaling exponent among different data points, indicating that certain points are more valuable in small datasets while others are relatively more useful as a part of large datasets. We provide learning theory to support our scaling law, and we observe empirically that it holds across diverse model classes. We further propose a maximum likelihood estimator and an amortized estimator to efficiently learn the individualized scaling behaviors from a small number of noisy observations per data point. Using our estimators, we provide insights into factors that influence the scaling behavior of different data points. Finally, we demonstrate applications of the individualized scaling laws to data valuation and data subset selection. Overall, our work represents a first step towards understanding and utilizing scaling properties for the value of individual data points.

Scaling Laws for the Value of Individual Data Points in Machine Learning

TL;DR

The paper addresses how individual data points contribute to learning as the dataset grows, challenging aggregate scaling only by introducing individualized data scaling laws. It defines the marginal contribution of a point z as and posits that the expected marginal contribution over datasets of size k follows , with point-specific parameters. The authors provide both empirical evidence across linear and nonlinear models and theoretical insights (α(z) typically in [1,1.5], α(z)=2 in linear regression) and develop two efficient estimators—the maximum-likelihood-based and an amortized neural estimator—to learn per-point scaling from limited, noisy observations. They further demonstrate practical applications to data valuation and data-subset selection, showing that scaling-law-informed scores can approximate distributional Shapley valuations and guide point additions in a size-dependent manner. Overall, this work initiates a data-centric view of how data value scales with dataset size and offers practical tools for estimating and leveraging these scales in real-world datasets.

Abstract

Recent works have shown that machine learning models improve at a predictable rate with the total amount of training data, leading to scaling laws that describe the relationship between error and dataset size. These scaling laws can help design a model's training dataset, but they typically take an aggregate view of the data by only considering the dataset's size. We introduce a new perspective by investigating scaling behavior for the value of individual data points: we find that a data point's contribution to model's performance shrinks predictably with the size of the dataset in a log-linear manner. Interestingly, there is significant variability in the scaling exponent among different data points, indicating that certain points are more valuable in small datasets while others are relatively more useful as a part of large datasets. We provide learning theory to support our scaling law, and we observe empirically that it holds across diverse model classes. We further propose a maximum likelihood estimator and an amortized estimator to efficiently learn the individualized scaling behaviors from a small number of noisy observations per data point. Using our estimators, we provide insights into factors that influence the scaling behavior of different data points. Finally, we demonstrate applications of the individualized scaling laws to data valuation and data subset selection. Overall, our work represents a first step towards understanding and utilizing scaling properties for the value of individual data points.
Paper Structure (22 sections, 5 theorems, 50 equations, 20 figures, 8 tables)

This paper contains 22 sections, 5 theorems, 50 equations, 20 figures, 8 tables.

Key Result

Theorem 1.1

If we denote the noise in $z$ as $\epsilon = y - x^\top\beta^*$, we have the following expectation with respect to the labels conditioned on the preceding dataset $\mathbf{X}_\mathcal{D}$:

Figures (20)

  • Figure 1: Individualized scaling laws for logistic regression trained on the IMDB dataset. Top: Marginal contribution vs. the dataset size in log-scale for several data points with a range of scaling exponents $\alpha(z)$. Left: Histogram of $R^2$ scores for linear trend lines fit to each data point in the log-scale. Right: Plot of the $R^2$ score from our scaling law predictions at each cardinality, measured across data points. We achieve an overall $R^2 = 0.987$ for the predictions across all points and dataset sizes.
  • Figure 2: Likelihood-based scaling law estimator. The estimator is fit for a single example from the IMDB dataset, and the scaling parameters are fit using $m = 100$ samples.
  • Figure 3: Histogram of $R^2$ score when fitting the scaling law. Similar to \ref{['fig:validation']}, we find that linear trends in log-space achieve high $R^2$ scores for most data points, which supports the parametric form in \ref{['eq:scaling-law']}.
  • Figure 4: Histogram of estimated $\alpha(z)$. The estimated values have a mode between 1 and 1.5 and exhibit significant heterogeneity. We exclude points with $R^2<0.8$ to ensure the estimated $\alpha(z)$ values are reliably estimated.
  • Figure 5: Marginal contribution for points with different $\alpha(z)$. Similar to \ref{['fig:validation']}, we plot the expected contribution $\log |\psi_k(z)|$ against the dataset size $\log k$. Lines with different $\alpha(z)$ cross one another, indicating that the ranking of valuable points depends on the dataset size $k$.
  • ...and 15 more figures

Theorems & Definitions (8)

  • Theorem 1.1
  • Theorem 1.2
  • proof
  • Lemma 1.3: Asymptotic normality of M-estimators
  • Lemma 1.4: Corollary 7.1 from kuchibhotla2018deterministic
  • Theorem 1.7: Formal version of \ref{['thm:m-estimator']}
  • proof : Proof of \ref{['thm:m-estimator']}
  • Remark 1.8