Table of Contents
Fetching ...

Beta Shapley: a Unified and Noise-reduced Data Valuation Framework for Machine Learning

Yongchan Kwon, James Zou

TL;DR

Beta Shapley generalizes data-valuations by relaxing the efficiency axiom, yielding a semivalue-based framework that emphasizes small-cardinality marginal contributions to reduce noise. It provides a closed-form Beta-weighted scheme that encompasses Data Shapley as a special case and offers efficient Monte Carlo estimation. The approach improves performance on tasks like noisy-label detection, subsampling quality, and data-point addition/removal, often outperforming state-of-the-art methods. This work offers a principled, scalable route to robust data valuation with practical implications for data curation and sample-efficient learning.

Abstract

Data Shapley has recently been proposed as a principled framework to quantify the contribution of individual datum in machine learning. It can effectively identify helpful or harmful data points for a learning algorithm. In this paper, we propose Beta Shapley, which is a substantial generalization of Data Shapley. Beta Shapley arises naturally by relaxing the efficiency axiom of the Shapley value, which is not critical for machine learning settings. Beta Shapley unifies several popular data valuation methods and includes data Shapley as a special case. Moreover, we prove that Beta Shapley has several desirable statistical properties and propose efficient algorithms to estimate it. We demonstrate that Beta Shapley outperforms state-of-the-art data valuation methods on several downstream ML tasks such as: 1) detecting mislabeled training data; 2) learning with subsamples; and 3) identifying points whose addition or removal have the largest positive or negative impact on the model.

Beta Shapley: a Unified and Noise-reduced Data Valuation Framework for Machine Learning

TL;DR

Beta Shapley generalizes data-valuations by relaxing the efficiency axiom, yielding a semivalue-based framework that emphasizes small-cardinality marginal contributions to reduce noise. It provides a closed-form Beta-weighted scheme that encompasses Data Shapley as a special case and offers efficient Monte Carlo estimation. The approach improves performance on tasks like noisy-label detection, subsampling quality, and data-point addition/removal, often outperforming state-of-the-art methods. This work offers a principled, scalable route to robust data valuation with practical implications for data curation and sample-efficient learning.

Abstract

Data Shapley has recently been proposed as a principled framework to quantify the contribution of individual datum in machine learning. It can effectively identify helpful or harmful data points for a learning algorithm. In this paper, we propose Beta Shapley, which is a substantial generalization of Data Shapley. Beta Shapley arises naturally by relaxing the efficiency axiom of the Shapley value, which is not critical for machine learning settings. Beta Shapley unifies several popular data valuation methods and includes data Shapley as a special case. Moreover, we prove that Beta Shapley has several desirable statistical properties and propose efficient algorithms to estimate it. We demonstrate that Beta Shapley outperforms state-of-the-art data valuation methods on several downstream ML tasks such as: 1) detecting mislabeled training data; 2) learning with subsamples; and 3) identifying points whose addition or removal have the largest positive or negative impact on the model.

Paper Structure

This paper contains 35 sections, 5 theorems, 28 equations, 14 figures, 3 tables, 1 algorithm.

Key Result

Theorem 1

Suppose the cardinality $j = o(n^{1/2})$ and assume that $\lim_{j \to \infty}\zeta_{j}/(j\zeta_{1})$ is bounded. Then, $(j^2 \zeta_{1}/n) ^{-1} \mathrm{Var}(\Delta_j (z^*; U, \mathfrak{D})) \to 1$ as $n$ increases.

Figures (14)

  • Figure 1: The signal-to-noise ratio of $\Delta_j (z^*; U, \mathfrak{D})$ as a function of the cardinality $j$ when $n=500$ in (left) regression and (right) classification settings. The data are generated from a generalized linear model. The signal-to-noise ratio generally decreases as the cardinality $j$ increases, showing that when $j$ is large, the signal of the marginal contribution at large cardinality is more likely to be perturbed by noise.
  • Figure 2: Illustrations of the marginal contribution $\Delta_j (z^*; U, \mathcal{D})$ as a function of the cardinality $j$ on the four datasets. Each color indicates a clean data point (yellow) and a noisy data point (blue). When the cardinality $j$ is large, the marginal contributions of the two groups become similar, so it is difficult to determine whether a data point is noisy by $\Delta_j (z^*; U, \mathcal{D})$. We provide additional results on different datasets in Appendix.
  • Figure 3: Illustration of the normalized weight $\tilde{w}_{\alpha,\beta}^{(n)}(j)$ for various pairs of $(\alpha,\beta)$ when $n=200$. Each color indicates a different hyperparameter pair $(\alpha,\beta)$.
  • Figure 4: A boxplot of the F1-score of the data valuation methods on the CIFAR100 test dataset. The red dot indicates the mean of the F1-score. Beta Shapley that focuses on small cardinality detects mislabeled data points better than other baseline methods.
  • Figure 5: Examples of mislabeled images in the CIFAR100 test dataset. The corrected label suggested by northcutt2021pervasive is provided for comparison. Beta(16,1) values for the mislabeled samples are negative, meaning that this type of labeling error can harm the model.
  • ...and 9 more figures

Theorems & Definitions (13)

  • Definition 1: Marginal contribution
  • Theorem 1: Asymptotic distribution of the marginal contribution
  • Remark 1
  • Definition 2: semivalue
  • Theorem 2: Representation of semivalues
  • Proposition 3
  • Theorem 4: Informal
  • Theorem 5: Formal version of Theorem \ref{['thm:prob_shap_asymptotic']}
  • proof : Proof of Equation \ref{['eqn:general_weights']}
  • proof : Proof of Theorem \ref{['thm:u_stat_for_large_cardinality']}
  • ...and 3 more