Rethinking Data Shapley for Data Selection Tasks: Misleads and Merits

Jiachen T. Wang; Tianji Yang; James Zou; Yongchan Kwon; Ruoxi Jia

Rethinking Data Shapley for Data Selection Tasks: Misleads and Merits

Jiachen T. Wang, Tianji Yang, James Zou, Yongchan Kwon, Ruoxi Jia

TL;DR

This work interrogates the reliability of Data Shapley for data selection, revealing that without structural assumptions on the utility function, Shapley-based selection can be no better than random guessing due to non-injectivity from v to φ(v). It identifies a broad MTM function class as a sufficient condition for Shapley-optimal data selection and proposes a practical heuristic that estimates how well v can be approximated by MTM, using the MTM-fitting residual to predict effectiveness. The authors connect MTM approximation quality to a ρ-consistency index, and validate the theory with experiments showing that Data Shapley performs well on heterogeneous data but poorly on homogeneous/clean data, with the MTM residual serving as a practical predictor. The results advance understanding of when data valuation via Shapley is meaningful and offer a principled approach to anticipate its utility in real-world data selection tasks.

Abstract

Data Shapley provides a principled approach to data valuation and plays a crucial role in data-centric machine learning (ML) research. Data selection is considered a standard application of Data Shapley. However, its data selection performance has shown to be inconsistent across settings in the literature. This study aims to deepen our understanding of this phenomenon. We introduce a hypothesis testing framework and show that Data Shapley's performance can be no better than random selection without specific constraints on utility functions. We identify a class of utility functions, monotonically transformed modular functions, within which Data Shapley optimally selects data. Based on this insight, we propose a heuristic for predicting Data Shapley's effectiveness in data selection tasks. Our experiments corroborate these findings, adding new insights into when Data Shapley may or may not succeed.

Rethinking Data Shapley for Data Selection Tasks: Misleads and Merits

TL;DR

Abstract

Paper Structure (27 sections, 12 theorems, 27 equations, 10 figures, 2 tables)

This paper contains 27 sections, 12 theorems, 27 equations, 10 figures, 2 tables.

Introduction
Background
Why Might Data Shapley Fail in Data Selection Tasks?
A Hypothesis Testing Framework for Comparing Dataset Utilities
Analysis
When does Data Shapley Select Good Datasets?
Illustrating Example: Heterogeneous-Quality Datasets
A Class of "Shapley-effective" Utility Functions
A Heuristic for Predicting Data Shapley's Optimality for General Utility Functions
When MTM function is a good approximation?
Experiments
When does Data Shapley work well/bad for data selection?
MTM Fitting Residual vs Data Selection Performance
Conclusion & Limitations
Extended Related Works
...and 12 more sections

Key Result

Theorem 2

For the utility comparison hypothesis testing problem formulated in (eq:hypo), any Shapley value-based hypothesis test $\mathbf{h}$ is constrained to:

Figures (10)

Figure 1: Validation accuracy curves as a function of the top $p\%$ most valuable data points added. The higher, the better. 'Random (average)' and 'Random (maximum)' mean sample different size-$k$ subsets uniformly at random and evaluate their average and maximum utility, respectively. Data Shapley's error bar indicates the standard deviation across 5 independent runs where the randomness is from the permutation sampling of Data Shapley scores.
Figure 2: We investigate the correlation between data selection performance (measured by the normalized utility difference) and the normalized fitting residual of MTM function. For each dataset, we look at size-$k$ data selection performance with $k \in \{0.1n, 0.3n, 0.5n, 0.7n\}$. Each point represents the results on a dataset (with different noise-flipping ratios).
Figure 3: Additional results when using logistic regression classifiers. Validation accuracy curves as a function of the most valuable data points added. The higher, the better. 'Random (average)' and 'Random (maximum)' means sample different size-$k$ subsets uniformly and random and evaluate their average and maximum utility, respectively. Data Shapley's error bar indicates the standard deviation across 5 independent runs where the randomness is from the permutation sampling of Data Shapley scores.
Figure 4: Additional results when using MLP classifiers. The figure shows the validation accuracy curves as a function of the most valuable data points added. The higher, the better. 'Random (average)' and 'Random (maximum)' means sample different size-$k$ subsets uniformly and random and evaluate their average and maximum utility, respectively. Data Shapley's error bar indicates the standard deviation across 5 independent runs where the randomness is from the permutation sampling of Data Shapley scores.
Figure 5: Results on additional datasets for the correlation between $\bar{\mathcal{R}}_{v}$ and data selection performance. We investigate the correlation between data selection performance and the normalized fitting residual of MTM function. For each dataset, we look at size-$k$ data selection performance with $k \in \{0.1n, 0.3n, 0.5n, 0.7n\}$. Each point represents the results on a dataset (with different noise-flipping ratios).
...and 5 more figures

Theorems & Definitions (30)

Definition 1: shapley1953value
Remark 1
Remark 2: All the information available to $\mathbf{h}$ is $\phi(v)$
Theorem 2
Theorem 3
Remark 3: Example of utility functions with identical Shapley values
Remark 4
Theorem 4
Definition 5: Monotonically Transformed Modular Function (MTM)
Remark 5
...and 20 more

Rethinking Data Shapley for Data Selection Tasks: Misleads and Merits

TL;DR

Abstract

Rethinking Data Shapley for Data Selection Tasks: Misleads and Merits

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (30)