Data-Efficient Learning via Clustering-Based Sensitivity Sampling: Foundation Models and Beyond

Kyriakos Axiotis; Vincent Cohen-Addad; Monika Henzinger; Sammy Jerome; Vahab Mirrokni; David Saulpic; David Woodruff; Michael Wunder

Data-Efficient Learning via Clustering-Based Sensitivity Sampling: Foundation Models and Beyond

Kyriakos Axiotis, Vincent Cohen-Addad, Monika Henzinger, Sammy Jerome, Vahab Mirrokni, David Saulpic, David Woodruff, Michael Wunder

TL;DR

The paper tackles data-efficient learning by selecting small, representative subsets via a clustering-based sensitivity sampling framework. It combines a full-data $k$-means clustering with sub-sample sensitivity sampling under a Hölder-continuous loss assumption to produce a weighted coreset whose average loss approximates the full dataset within a factor of $1\pm\varepsilon$ plus an additive term tied to the clustering cost. The approach is proven to be robust to outliers through a $(k,z)$-clustering objective and extended to regression tasks, with practical algorithms that operate in a small number of adaptive rounds. Empirical results demonstrate gains in fine-tuning foundation models and image-classification tasks, while achieving competitive performance for linear regression with significantly reduced computation. Overall, the method offers theoretically grounded, scalable data selection that improves training efficiency for large models and complex datasets.

Abstract

We study the data selection problem, whose aim is to select a small representative subset of data that can be used to efficiently train a machine learning model. We present a new data selection approach based on $k$-means clustering and sensitivity sampling. Assuming access to an embedding representation of the data with respect to which the model loss is Hölder continuous, our approach provably allows selecting a set of ``typical'' $k + 1/\varepsilon^2$ elements whose average loss corresponds to the average loss of the whole dataset, up to a multiplicative $(1\pm\varepsilon)$ factor and an additive $\varepsilon λΦ_k$, where $Φ_k$ represents the $k$-means cost for the input embeddings and $λ$ is the Hölder constant. We furthermore demonstrate the performance and scalability of our approach on fine-tuning foundation models and show that it outperforms state-of-the-art methods. We also show how it can be applied on linear regression, leading to a new sampling strategy that surprisingly matches the performances of leverage score sampling, while being conceptually simpler and more scalable.

Data-Efficient Learning via Clustering-Based Sensitivity Sampling: Foundation Models and Beyond

TL;DR

The paper tackles data-efficient learning by selecting small, representative subsets via a clustering-based sensitivity sampling framework. It combines a full-data

-means clustering with sub-sample sensitivity sampling under a Hölder-continuous loss assumption to produce a weighted coreset whose average loss approximates the full dataset within a factor of

plus an additive term tied to the clustering cost. The approach is proven to be robust to outliers through a

-clustering objective and extended to regression tasks, with practical algorithms that operate in a small number of adaptive rounds. Empirical results demonstrate gains in fine-tuning foundation models and image-classification tasks, while achieving competitive performance for linear regression with significantly reduced computation. Overall, the method offers theoretically grounded, scalable data selection that improves training efficiency for large models and complex datasets.

Abstract

-means clustering and sensitivity sampling. Assuming access to an embedding representation of the data with respect to which the model loss is Hölder continuous, our approach provably allows selecting a set of ``typical''

elements whose average loss corresponds to the average loss of the whole dataset, up to a multiplicative

factor and an additive

, where

represents the

-means cost for the input embeddings and

is the Hölder constant. We furthermore demonstrate the performance and scalability of our approach on fine-tuning foundation models and show that it outperforms state-of-the-art methods. We also show how it can be applied on linear regression, leading to a new sampling strategy that surprisingly matches the performances of leverage score sampling, while being conceptually simpler and more scalable.

Paper Structure (45 sections, 7 theorems, 21 equations, 9 figures, 2 tables, 3 algorithms)

This paper contains 45 sections, 7 theorems, 21 equations, 9 figures, 2 tables, 3 algorithms.

Introduction
Our Approach and Contribution
Problem Formulation
Our Model
Assumptions on $\ell$
Limits to the general case
Assumption on the embeddings
Hölder Continuity assumption
Clustering Preliminaries
Algorithmic Results
Algorithm and Lower Bound for the Non-Adaptive Case
Adaptive Algorithms
1-Round Algorithm
$r$-Round Algorithm
Computing the clustering $\mathcal{C}$ and the parameter $\Lambda$
...and 30 more sections

Key Result

Theorem 2

Let $\varepsilon,z > 0$, $\Lambda \in \mathbb{R}^k$. Let $\mathcal{D}$ be a dataset and $\ell$ a loss function that is $(z, \Lambda)$-well-behaved with respect to an embedding $E$ and a clustering $(C_1, ..., C_k)$ into $k$ clusters. Then, there exists an algorithm that makes $k$ queries to $\ell$ a with constant probability, where $\Phi^\Lambda_{\mathcal{C}, z}(\mathcal{D}) = \sum_{i=1}^k \Lambda

Figures (9)

Figure 1: Distribution of loss to random point vs center of corresponding cluster for the WMT T2T EnDe translation dataset bojar-EtAl:2014:W14-33 using BERT embeddings devlin2018bert.
Figure 2: Experimental results on the WMT T2T EnDe translation task dataset. We report the accuracy (left) and BLEU score (right) of the different methods used: Our method (Sensitivity) compared to Diversity (similar to SenerS18), Uniform cleaned (Random-Deduped), and Uniform (Random). Each method is required to produce a sample of roughly $1\%$ of the whole dataset.
Figure 3: Experimental results for selecting $k=2000$ data points and different datasets. For each algorithm, we show the accuracy on the validation dataset.
Figure 4: Plots of experimental results for different datasets. For each algorithm, we plot the accuracy on the validation dataset for different values of $k$ (number of samples). We also provide a runtime comparison on CIFAR10. We independently run each data point $100$ times, and present the mean with bands of one standard deviation.
Figure 5: Experimental results for selecting $k=2000$ data points and different datasets. For each algorithm, we show the accuracy on the validation dataset, averaged over 100 runs.
...and 4 more figures

Theorems & Definitions (15)

Theorem 2
Definition 3: Data Selection, SenerS18
Definition 4: $r$-Adaptive Active learning under well-behaved norm
Theorem 5
Remark 6
Theorem 7
Lemma 8
Definition 9
Theorem 11
Theorem 12: Bernstein's inequality
...and 5 more

Data-Efficient Learning via Clustering-Based Sensitivity Sampling: Foundation Models and Beyond

TL;DR

Abstract

Data-Efficient Learning via Clustering-Based Sensitivity Sampling: Foundation Models and Beyond

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (15)