Table of Contents
Fetching ...

Data Selection for ERMs

Steve Hanneke, Shay Moran, Alexander Shlimovich, Amir Yehudayoff

TL;DR

This work reframes core learning questions from a data-centric lens: given a fixed learning rule, what is the smallest training subset $n$ that can yield performance close to training on the full population? The authors establish tight data-selection results across mean estimation, linear classification, linear regression, and stochastic convex optimization, revealing that carefully chosen small subsets can dramatically outperform random sampling, with rates governed by problem structure such as VC dimension and star number. They provide new bounds, a taxonomy of error rates, and techniques (including Carathéodory-style sparsification, Steinitz-type arguments, and epsilon-net connections) to characterize when a small subset suffices and when it does not. The paper also connects data selection to coresets and sample compression notions, and outlines rich directions for future exploration, including regression with unweighted selection and additive guarantees in SCO. Overall, the results illuminate when and how a fixed ERM can achieve near-optimal performance from a compact, weighted or unweighted data subset, offering both theoretical insight and practical implications for data-efficient learning.

Abstract

Learning theory has traditionally followed a model-centric approach, focusing on designing optimal algorithms for a fixed natural learning task (e.g., linear classification or regression). In this paper, we adopt a complementary data-centric perspective, whereby we fix a natural learning rule and focus on optimizing the training data. Specifically, we study the following question: given a learning rule $\mathcal{A}$ and a data selection budget $n$, how well can $\mathcal{A}$ perform when trained on at most $n$ data points selected from a population of $N$ points? We investigate when it is possible to select $n \ll N$ points and achieve performance comparable to training on the entire population. We address this question across a variety of empirical risk minimizers. Our results include optimal data-selection bounds for mean estimation, linear classification, and linear regression. Additionally, we establish two general results: a taxonomy of error rates in binary classification and in stochastic convex optimization. Finally, we propose several open questions and directions for future research.

Data Selection for ERMs

TL;DR

This work reframes core learning questions from a data-centric lens: given a fixed learning rule, what is the smallest training subset that can yield performance close to training on the full population? The authors establish tight data-selection results across mean estimation, linear classification, linear regression, and stochastic convex optimization, revealing that carefully chosen small subsets can dramatically outperform random sampling, with rates governed by problem structure such as VC dimension and star number. They provide new bounds, a taxonomy of error rates, and techniques (including Carathéodory-style sparsification, Steinitz-type arguments, and epsilon-net connections) to characterize when a small subset suffices and when it does not. The paper also connects data selection to coresets and sample compression notions, and outlines rich directions for future exploration, including regression with unweighted selection and additive guarantees in SCO. Overall, the results illuminate when and how a fixed ERM can achieve near-optimal performance from a compact, weighted or unweighted data subset, offering both theoretical insight and practical implications for data-efficient learning.

Abstract

Learning theory has traditionally followed a model-centric approach, focusing on designing optimal algorithms for a fixed natural learning task (e.g., linear classification or regression). In this paper, we adopt a complementary data-centric perspective, whereby we fix a natural learning rule and focus on optimizing the training data. Specifically, we study the following question: given a learning rule and a data selection budget , how well can perform when trained on at most data points selected from a population of points? We investigate when it is possible to select points and achieve performance comparable to training on the entire population. We address this question across a variety of empirical risk minimizers. Our results include optimal data-selection bounds for mean estimation, linear classification, and linear regression. Additionally, we establish two general results: a taxonomy of error rates in binary classification and in stochastic convex optimization. Finally, we propose several open questions and directions for future research.

Paper Structure

This paper contains 45 sections, 21 theorems, 92 equations, 2 figures.

Key Result

Theorem 1

For every $n \geq 1$, where $D$ ranges over all finite multisets of $\mathbb{R}$. Above, the ratio $\frac{L^\star_D(n)}{L^\star_D}$ is defined to be $1$ when $L_D^\star(n)= L_D^\star=0$. If only $L_D^\star=0$, the ratio is defined as $\infty$.

Figures (2)

  • Figure 1: An example of a non-continuous ERM $A$ satisfying $L^\star_D(n=d; A) = 0$, illustrated for dimensions $d = 1$ and $d = 2$. The separating hyperplane is chosen so that it is the affine span of $d$ input points; the order of these points encodes which of the two halfspaces is labeled '$+$' and which is labeled '$-$'. If the $d$ points on the hyperplane include both '$+$' and '$-$ labels, the hyperplane is recursively labeled by dividing it into two halves, assigning half of it $+$ and the other half $-$, in a consistent manner. Note that the resulting classifier forms a generalized half-space where both halves are convex sets. However, it need not be either open or closed. See the 2D example (right picture) for an illustration.
  • Figure 2: A 2D illustration for Example \ref{['ex:strict_convexity']}. The three vectors $(\mathbf{v}_1,\mathbf{v}_2,\mathbf{v}_3)$ define convex functions $f_{\mathbf{v}_1}$, $f_{\mathbf{v}_2}$, $f_{\mathbf{v}_3}$. The red hexagon marks the common zero set (minimizers) when all three functions are combined. Removing $f_{\mathbf{v}_3}$ enlarges the zero set to the blue region (a rhombus), showing how the feasible set for an ERM grows once strict convexity is violated.

Theorems & Definitions (32)

  • Theorem 1: Mean Estimation
  • Proposition 2: Carathéodory's Theorem for Convex Functions
  • Definition 3
  • Theorem 4: Linear Classification
  • Theorem 5: Binary Classification
  • Example 1: Linear Regression
  • Definition 6
  • Definition 7
  • Theorem 8: Linear Regression
  • Example 2
  • ...and 22 more