Table of Contents
Fetching ...

Data-Efficient Learning via Clustering-Based Sensitivity Sampling: Foundation Models and Beyond

Kyriakos Axiotis, Vincent Cohen-Addad, Monika Henzinger, Sammy Jerome, Vahab Mirrokni, David Saulpic, David Woodruff, Michael Wunder

TL;DR

The paper tackles data-efficient learning by selecting small, representative subsets via a clustering-based sensitivity sampling framework. It combines a full-data $k$-means clustering with sub-sample sensitivity sampling under a Hölder-continuous loss assumption to produce a weighted coreset whose average loss approximates the full dataset within a factor of $1\pm\varepsilon$ plus an additive term tied to the clustering cost. The approach is proven to be robust to outliers through a $(k,z)$-clustering objective and extended to regression tasks, with practical algorithms that operate in a small number of adaptive rounds. Empirical results demonstrate gains in fine-tuning foundation models and image-classification tasks, while achieving competitive performance for linear regression with significantly reduced computation. Overall, the method offers theoretically grounded, scalable data selection that improves training efficiency for large models and complex datasets.

Abstract

We study the data selection problem, whose aim is to select a small representative subset of data that can be used to efficiently train a machine learning model. We present a new data selection approach based on $k$-means clustering and sensitivity sampling. Assuming access to an embedding representation of the data with respect to which the model loss is Hölder continuous, our approach provably allows selecting a set of ``typical'' $k + 1/\varepsilon^2$ elements whose average loss corresponds to the average loss of the whole dataset, up to a multiplicative $(1\pm\varepsilon)$ factor and an additive $\varepsilon λΦ_k$, where $Φ_k$ represents the $k$-means cost for the input embeddings and $λ$ is the Hölder constant. We furthermore demonstrate the performance and scalability of our approach on fine-tuning foundation models and show that it outperforms state-of-the-art methods. We also show how it can be applied on linear regression, leading to a new sampling strategy that surprisingly matches the performances of leverage score sampling, while being conceptually simpler and more scalable.

Data-Efficient Learning via Clustering-Based Sensitivity Sampling: Foundation Models and Beyond

TL;DR

The paper tackles data-efficient learning by selecting small, representative subsets via a clustering-based sensitivity sampling framework. It combines a full-data -means clustering with sub-sample sensitivity sampling under a Hölder-continuous loss assumption to produce a weighted coreset whose average loss approximates the full dataset within a factor of plus an additive term tied to the clustering cost. The approach is proven to be robust to outliers through a -clustering objective and extended to regression tasks, with practical algorithms that operate in a small number of adaptive rounds. Empirical results demonstrate gains in fine-tuning foundation models and image-classification tasks, while achieving competitive performance for linear regression with significantly reduced computation. Overall, the method offers theoretically grounded, scalable data selection that improves training efficiency for large models and complex datasets.

Abstract

We study the data selection problem, whose aim is to select a small representative subset of data that can be used to efficiently train a machine learning model. We present a new data selection approach based on -means clustering and sensitivity sampling. Assuming access to an embedding representation of the data with respect to which the model loss is Hölder continuous, our approach provably allows selecting a set of ``typical'' elements whose average loss corresponds to the average loss of the whole dataset, up to a multiplicative factor and an additive , where represents the -means cost for the input embeddings and is the Hölder constant. We furthermore demonstrate the performance and scalability of our approach on fine-tuning foundation models and show that it outperforms state-of-the-art methods. We also show how it can be applied on linear regression, leading to a new sampling strategy that surprisingly matches the performances of leverage score sampling, while being conceptually simpler and more scalable.
Paper Structure (45 sections, 7 theorems, 21 equations, 9 figures, 2 tables, 3 algorithms)

This paper contains 45 sections, 7 theorems, 21 equations, 9 figures, 2 tables, 3 algorithms.

Key Result

Theorem 2

Let $\varepsilon,z > 0$, $\Lambda \in \mathbb{R}^k$. Let $\mathcal{D}$ be a dataset and $\ell$ a loss function that is $(z, \Lambda)$-well-behaved with respect to an embedding $E$ and a clustering $(C_1, ..., C_k)$ into $k$ clusters. Then, there exists an algorithm that makes $k$ queries to $\ell$ a with constant probability, where $\Phi^\Lambda_{\mathcal{C}, z}(\mathcal{D}) = \sum_{i=1}^k \Lambda

Figures (9)

  • Figure 1: Distribution of loss to random point vs center of corresponding cluster for the WMT T2T EnDe translation dataset bojar-EtAl:2014:W14-33 using BERT embeddings devlin2018bert.
  • Figure 2: Experimental results on the WMT T2T EnDe translation task dataset. We report the accuracy (left) and BLEU score (right) of the different methods used: Our method (Sensitivity) compared to Diversity (similar to SenerS18), Uniform cleaned (Random-Deduped), and Uniform (Random). Each method is required to produce a sample of roughly $1\%$ of the whole dataset.
  • Figure 3: Experimental results for selecting $k=2000$ data points and different datasets. For each algorithm, we show the accuracy on the validation dataset.
  • Figure 4: Plots of experimental results for different datasets. For each algorithm, we plot the accuracy on the validation dataset for different values of $k$ (number of samples). We also provide a runtime comparison on CIFAR10. We independently run each data point $100$ times, and present the mean with bands of one standard deviation.
  • Figure 5: Experimental results for selecting $k=2000$ data points and different datasets. For each algorithm, we show the accuracy on the validation dataset, averaged over 100 runs.
  • ...and 4 more figures

Theorems & Definitions (15)

  • Theorem 2
  • Definition 3: Data Selection, SenerS18
  • Definition 4: $r$-Adaptive Active learning under well-behaved norm
  • Theorem 5
  • Remark 6
  • Theorem 7
  • Lemma 8
  • Definition 9
  • Theorem 11
  • Theorem 12: Bernstein's inequality
  • ...and 5 more