Rethinking Data Shapley for Data Selection Tasks: Misleads and Merits
Jiachen T. Wang, Tianji Yang, James Zou, Yongchan Kwon, Ruoxi Jia
TL;DR
This work interrogates the reliability of Data Shapley for data selection, revealing that without structural assumptions on the utility function, Shapley-based selection can be no better than random guessing due to non-injectivity from v to φ(v). It identifies a broad MTM function class as a sufficient condition for Shapley-optimal data selection and proposes a practical heuristic that estimates how well v can be approximated by MTM, using the MTM-fitting residual to predict effectiveness. The authors connect MTM approximation quality to a ρ-consistency index, and validate the theory with experiments showing that Data Shapley performs well on heterogeneous data but poorly on homogeneous/clean data, with the MTM residual serving as a practical predictor. The results advance understanding of when data valuation via Shapley is meaningful and offer a principled approach to anticipate its utility in real-world data selection tasks.
Abstract
Data Shapley provides a principled approach to data valuation and plays a crucial role in data-centric machine learning (ML) research. Data selection is considered a standard application of Data Shapley. However, its data selection performance has shown to be inconsistent across settings in the literature. This study aims to deepen our understanding of this phenomenon. We introduce a hypothesis testing framework and show that Data Shapley's performance can be no better than random selection without specific constraints on utility functions. We identify a class of utility functions, monotonically transformed modular functions, within which Data Shapley optimally selects data. Based on this insight, we propose a heuristic for predicting Data Shapley's effectiveness in data selection tasks. Our experiments corroborate these findings, adding new insights into when Data Shapley may or may not succeed.
