Data Selection for ERMs
Steve Hanneke, Shay Moran, Alexander Shlimovich, Amir Yehudayoff
TL;DR
This work reframes core learning questions from a data-centric lens: given a fixed learning rule, what is the smallest training subset $n$ that can yield performance close to training on the full population? The authors establish tight data-selection results across mean estimation, linear classification, linear regression, and stochastic convex optimization, revealing that carefully chosen small subsets can dramatically outperform random sampling, with rates governed by problem structure such as VC dimension and star number. They provide new bounds, a taxonomy of error rates, and techniques (including Carathéodory-style sparsification, Steinitz-type arguments, and epsilon-net connections) to characterize when a small subset suffices and when it does not. The paper also connects data selection to coresets and sample compression notions, and outlines rich directions for future exploration, including regression with unweighted selection and additive guarantees in SCO. Overall, the results illuminate when and how a fixed ERM can achieve near-optimal performance from a compact, weighted or unweighted data subset, offering both theoretical insight and practical implications for data-efficient learning.
Abstract
Learning theory has traditionally followed a model-centric approach, focusing on designing optimal algorithms for a fixed natural learning task (e.g., linear classification or regression). In this paper, we adopt a complementary data-centric perspective, whereby we fix a natural learning rule and focus on optimizing the training data. Specifically, we study the following question: given a learning rule $\mathcal{A}$ and a data selection budget $n$, how well can $\mathcal{A}$ perform when trained on at most $n$ data points selected from a population of $N$ points? We investigate when it is possible to select $n \ll N$ points and achieve performance comparable to training on the entire population. We address this question across a variety of empirical risk minimizers. Our results include optimal data-selection bounds for mean estimation, linear classification, and linear regression. Additionally, we establish two general results: a taxonomy of error rates in binary classification and in stochastic convex optimization. Finally, we propose several open questions and directions for future research.
