Data-Efficient Learning via Clustering-Based Sensitivity Sampling: Foundation Models and Beyond
Kyriakos Axiotis, Vincent Cohen-Addad, Monika Henzinger, Sammy Jerome, Vahab Mirrokni, David Saulpic, David Woodruff, Michael Wunder
TL;DR
The paper tackles data-efficient learning by selecting small, representative subsets via a clustering-based sensitivity sampling framework. It combines a full-data $k$-means clustering with sub-sample sensitivity sampling under a Hölder-continuous loss assumption to produce a weighted coreset whose average loss approximates the full dataset within a factor of $1\pm\varepsilon$ plus an additive term tied to the clustering cost. The approach is proven to be robust to outliers through a $(k,z)$-clustering objective and extended to regression tasks, with practical algorithms that operate in a small number of adaptive rounds. Empirical results demonstrate gains in fine-tuning foundation models and image-classification tasks, while achieving competitive performance for linear regression with significantly reduced computation. Overall, the method offers theoretically grounded, scalable data selection that improves training efficiency for large models and complex datasets.
Abstract
We study the data selection problem, whose aim is to select a small representative subset of data that can be used to efficiently train a machine learning model. We present a new data selection approach based on $k$-means clustering and sensitivity sampling. Assuming access to an embedding representation of the data with respect to which the model loss is Hölder continuous, our approach provably allows selecting a set of ``typical'' $k + 1/\varepsilon^2$ elements whose average loss corresponds to the average loss of the whole dataset, up to a multiplicative $(1\pm\varepsilon)$ factor and an additive $\varepsilon λΦ_k$, where $Φ_k$ represents the $k$-means cost for the input embeddings and $λ$ is the Hölder constant. We furthermore demonstrate the performance and scalability of our approach on fine-tuning foundation models and show that it outperforms state-of-the-art methods. We also show how it can be applied on linear regression, leading to a new sampling strategy that surprisingly matches the performances of leverage score sampling, while being conceptually simpler and more scalable.
