Table of Contents
Fetching ...

SAVA: Scalable Learning-Agnostic Data Valuation

Samuel Kessler, Tam Le, Vu Nguyen

TL;DR

This work addresses the challenge of valuing data for training large models when training sets contain noisy artifacts by casting data valuation as a transport-based discrepancy between a noisy training distribution $\mu_t$ and a clean validation distribution $\mu_v$. It introduces SAVA, a scalable variant of LAVA that performs multiple small OT computations on data batches within a hierarchical OT framework, reducing memory usage from $O(N^2)$ to batch-scale costs while preserving valuation quality. The authors provide refined theoretical results for entropic regularization in OT gradients and demonstrate that SAVA scales to millions of data points with competitive performance on corruption-detection and data-pruning tasks, notably on CIFAR-10 with various corruptions and the large Clothing1M dataset. The approach enables practical OT-based data valuation for real-world large-scale datasets, offering efficiency gains and robust data selection without model-specific dependencies. Overall, SAVA advances data valuation by delivering scalable, model-agnostic data pruning that preserves valuation fidelity on massive web-scraped datasets.

Abstract

Selecting data for training machine learning models is crucial since large, web-scraped, real datasets contain noisy artifacts that affect the quality and relevance of individual data points. These noisy artifacts will impact model performance. We formulate this problem as a data valuation task, assigning a value to data points in the training set according to how similar or dissimilar they are to a clean and curated validation set. Recently, LAVA demonstrated the use of optimal transport (OT) between a large noisy training dataset and a clean validation set, to value training data efficiently, without the dependency on model performance. However, the LAVA algorithm requires the entire dataset as an input, this limits its application to larger datasets. Inspired by the scalability of stochastic (gradient) approaches which carry out computations on batches of data points instead of the entire dataset, we analogously propose SAVA, a scalable variant of LAVA with its computation on batches of data points. Intuitively, SAVA follows the same scheme as LAVA which leverages the hierarchically defined OT for data valuation. However, while LAVA processes the whole dataset, SAVA divides the dataset into batches of data points, and carries out the OT problem computation on those batches. Moreover, our theoretical derivations on the trade-off of using entropic regularization for OT problems include refinements of prior work. We perform extensive experiments, to demonstrate that SAVA can scale to large datasets with millions of data points and does not trade off data valuation performance.

SAVA: Scalable Learning-Agnostic Data Valuation

TL;DR

This work addresses the challenge of valuing data for training large models when training sets contain noisy artifacts by casting data valuation as a transport-based discrepancy between a noisy training distribution and a clean validation distribution . It introduces SAVA, a scalable variant of LAVA that performs multiple small OT computations on data batches within a hierarchical OT framework, reducing memory usage from to batch-scale costs while preserving valuation quality. The authors provide refined theoretical results for entropic regularization in OT gradients and demonstrate that SAVA scales to millions of data points with competitive performance on corruption-detection and data-pruning tasks, notably on CIFAR-10 with various corruptions and the large Clothing1M dataset. The approach enables practical OT-based data valuation for real-world large-scale datasets, offering efficiency gains and robust data selection without model-specific dependencies. Overall, SAVA advances data valuation by delivering scalable, model-agnostic data pruning that preserves valuation fidelity on massive web-scraped datasets.

Abstract

Selecting data for training machine learning models is crucial since large, web-scraped, real datasets contain noisy artifacts that affect the quality and relevance of individual data points. These noisy artifacts will impact model performance. We formulate this problem as a data valuation task, assigning a value to data points in the training set according to how similar or dissimilar they are to a clean and curated validation set. Recently, LAVA demonstrated the use of optimal transport (OT) between a large noisy training dataset and a clean validation set, to value training data efficiently, without the dependency on model performance. However, the LAVA algorithm requires the entire dataset as an input, this limits its application to larger datasets. Inspired by the scalability of stochastic (gradient) approaches which carry out computations on batches of data points instead of the entire dataset, we analogously propose SAVA, a scalable variant of LAVA with its computation on batches of data points. Intuitively, SAVA follows the same scheme as LAVA which leverages the hierarchically defined OT for data valuation. However, while LAVA processes the whole dataset, SAVA divides the dataset into batches of data points, and carries out the OT problem computation on those batches. Moreover, our theoretical derivations on the trade-off of using entropic regularization for OT problems include refinements of prior work. We perform extensive experiments, to demonstrate that SAVA can scale to large datasets with millions of data points and does not trade off data valuation performance.
Paper Structure (46 sections, 6 theorems, 32 equations, 12 figures, 2 tables, 1 algorithm)

This paper contains 46 sections, 6 theorems, 32 equations, 12 figures, 2 tables, 1 algorithm.

Key Result

Lemma 1

The calibrated gradient for a data point $z_k$ in the batch ${B}_i$ in ${\mathbb{D}}_t$ can be computed as: where the calibrated gradient of OT for measures on batches is calculated as follows:

Figures (12)

  • Figure 1: Overview of the proposed SAVA method. On the left-hand side, SAVA values data points in a noisy training dataset by comparing to a clean validation dataset. SAVA performs scalable data valuation by solving multiple cheap and small OT problems on batches of data points (on the right-hand side). Notations in Orange denote OT distances and plans over training and validation batches, while notations in Green denote OT distances over data points in a batch. $\text{OT}(\bar{\mu_t}, \bar{\mu_v})$ denotes the OT distance between training and validation batches and $\bar{\pi}^*(\bar{\mu}_t, \bar{\mu}_v)$ is the associated OT plan in Eq. (\ref{['eq:hot_plan']}). $\text{OT}(\mu_{B_i}, \mu_{B'_{j}})$ is the OT distance between the batch $B_i$ from the training set and the batch $B'_j$ in the validation set where we use the feature-label distance in Eq. (\ref{['eq:label_to_label']}) as the ground cost for labeled data points in these batches. $s_k$ is SAVA's final valuation score for training labeled data point $z_k$ in Eq. (\ref{['eq:HOT_data_point_tr']}). The hatched box denotes the summation over the validation batches to value the data point $z_k$. We provide a visualization of these artifacts generated by SAVA in \ref{['fig:sava_alg_artifacts']}.
  • Figure 2: SAVA can value the full CIFAR10 dataset with various corruptions, while LAVA has out-of-memory (OOM) issues. We sort training examples by the highest OT gradients in Eq. (\ref{['eq_lava_calibrated_gradient']}) and Eq. (\ref{['eq:HOT_data_point_tr']}) for LAVA and SAVA respectively, and use the fraction of corrupted data recovered for a prefix of size $N / 4$ as the detection rate (where $N$ is the training set size). The star symbol ($\textcolor{plt_green}{\bigstar}$) denotes the point at which LAVA is unable to continue valuing training due GPU out-of memory (OOM) errors.
  • Figure 3: SAVA can scale to a large web-scrapped dataset. We use SAVA and other baselines, to value data points and then prune a certain percentage of the noisy training set. The resulting dataset is used for training a classifier.
  • Figure 4: Examples of the data corruptions used in our experimental setup. Examples of data from the CIFAR$10$ dataset where the images have corruptions: noisy features, trojan square, and poison frogs corruptions respectively.
  • Figure 5: Data value rankings for various methods for the $10\%$ poison frogs corruption. The number of corrupt datapoints in the prefix determines the detection rate. The black dashed line represents the $N / 4$ prefix which is used for calculating the detection rates in \ref{['fig:cifar10_corruptions_detection']} and \ref{['fig:cifar10_corruptions_prune']}.
  • ...and 7 more figures

Theorems & Definitions (10)

  • Lemma 1
  • Theorem 2: Refined Theorem 2 in just2023lava
  • proof
  • Lemma 3: HOT with entropic regularized OT over batches
  • proof
  • Theorem 4: restated Theorem 2 in just2023lava
  • Theorem 5: restated Theorem \ref{['Theorem2_corrected']} in the main paper
  • proof
  • Lemma 6: restated Lemma \ref{['lemma2_HOT_vs_OT_eps']} in the main paper
  • proof