Table of Contents
Fetching ...

CHG Shapley: Efficient Data Valuation and Selection towards Trustworthy Machine Learning

Huaiguang Cai

TL;DR

CHG Shapley tackles data valuation and subset selection with a gradient- and hardness-informed utility, enabling efficient Shapley computations. It defines CHG Score to approximate training-subset impact and derives a closed-form Shapley for individual data points, reducing computational cost to a single training run. The method supports real-time data selection in large datasets and demonstrates robustness across standard, noisy-label, and imbalanced settings. The work contributes a practical, data-centric tool for trustworthy ML with significant speedups over prior methods.

Abstract

Understanding the decision-making process of machine learning models is crucial for ensuring trustworthy machine learning. Data Shapley, a landmark study on data valuation, advances this understanding by assessing the contribution of each datum to model performance. However, the resource-intensive and time-consuming nature of multiple model retraining poses challenges for applying Data Shapley to large datasets. To address this, we propose the CHG (compound of Hardness and Gradient) utility function, which approximates the utility of each data subset on model performance in every training epoch. By deriving the closed-form Shapley value for each data point using the CHG utility function, we reduce the computational complexity to that of a single model retraining, achieving a quadratic improvement over existing marginal contribution-based methods. We further leverage CHG Shapley for real-time data selection, conducting experiments across three settings: standard datasets, label noise datasets, and class imbalance datasets. These experiments demonstrate its effectiveness in identifying high-value and noisy data. By enabling efficient data valuation, CHG Shapley promotes trustworthy model training through a novel data-centric perspective. Our codes are available at https://github.com/caihuaiguang/CHG-Shapley-for-Data-Valuation and https://github.com/caihuaiguang/CHG-Shapley-for-Data-Selection.

CHG Shapley: Efficient Data Valuation and Selection towards Trustworthy Machine Learning

TL;DR

CHG Shapley tackles data valuation and subset selection with a gradient- and hardness-informed utility, enabling efficient Shapley computations. It defines CHG Score to approximate training-subset impact and derives a closed-form Shapley for individual data points, reducing computational cost to a single training run. The method supports real-time data selection in large datasets and demonstrates robustness across standard, noisy-label, and imbalanced settings. The work contributes a practical, data-centric tool for trustworthy ML with significant speedups over prior methods.

Abstract

Understanding the decision-making process of machine learning models is crucial for ensuring trustworthy machine learning. Data Shapley, a landmark study on data valuation, advances this understanding by assessing the contribution of each datum to model performance. However, the resource-intensive and time-consuming nature of multiple model retraining poses challenges for applying Data Shapley to large datasets. To address this, we propose the CHG (compound of Hardness and Gradient) utility function, which approximates the utility of each data subset on model performance in every training epoch. By deriving the closed-form Shapley value for each data point using the CHG utility function, we reduce the computational complexity to that of a single model retraining, achieving a quadratic improvement over existing marginal contribution-based methods. We further leverage CHG Shapley for real-time data selection, conducting experiments across three settings: standard datasets, label noise datasets, and class imbalance datasets. These experiments demonstrate its effectiveness in identifying high-value and noisy data. By enabling efficient data valuation, CHG Shapley promotes trustworthy model training through a novel data-centric perspective. Our codes are available at https://github.com/caihuaiguang/CHG-Shapley-for-Data-Valuation and https://github.com/caihuaiguang/CHG-Shapley-for-Data-Selection.
Paper Structure (23 sections, 3 theorems, 14 equations, 3 figures, 6 tables, 2 algorithms)

This paper contains 23 sections, 3 theorems, 14 equations, 3 figures, 6 tables, 2 algorithms.

Key Result

Lemma 1

Let $f$ be any differentiable function with $L$-Lipschitz continuous gradient, $\theta_x = \theta - \eta x$, and $\eta = 1/L$This is a commonly used assignment method, for example, in proving the convergence of gradient descent for non-convex functions to local minima Nesterov2018LecturesOC, $\eta$ Proof: Due to the definition of $L$-Lipschitz continuity, We have $f(\theta_x) \le f(\theta) + \la

Figures (3)

  • Figure 1: Shapley value's calculation.
  • Figure 2: Noisy feature detection experiment on CIFAR10-embeddings dataset.
  • Figure 3: Point removal experiment on CIFAR10-embeddings dataset.

Theorems & Definitions (4)

  • Definition 1: shapley1953value
  • Lemma 1: Nesterov2018LecturesOC
  • Theorem 1
  • Theorem 2