DU-Shapley: A Shapley Value Proxy for Efficient Dataset Valuation
Felipe Garrido-Lucero, Benjamin Heymann, Maxime Vono, Patrick Loiseau, Vianney Perchet
TL;DR
This work tackles the dataset valuation problem by leveraging Shapley values to quantify each data owner’s marginal contribution to a collaborative ML task. It introduces DU-Shapley, a discrete uniform Shapley proxy that reduces the number of required utility evaluations from exponential to linear in the number of data owners by exploiting a homogeneous, size-driven structure of the utility via a function $w(n)$. The authors provide both asymptotic convergence results and non-asymptotic error bounds, along with extensive empirical validation on synthetic and real-world datasets, showing that DU-Shapley outperforms or matches existing Monte Carlo Shapley estimators under budget constraints. The method enables efficient and scalable data valuation in settings with many data owners, while also highlighting limitations related to mean-field assumptions and suggesting paths toward a unified theory for Shapley estimators that exploit data structure. Overall, DU-Shapley offers a practical, theory-backed tool for incentive design and fair data sharing in collaborative ML applications.
Abstract
We consider the dataset valuation problem, that is, the problem of quantifying the incremental gain, to some relevant pre-defined utility of a machine learning task, of aggregating an individual dataset to others. The Shapley value is a natural tool to perform dataset valuation due to its formal axiomatic justification, which can be combined with Monte Carlo integration to overcome the computational tractability challenges. Such generic approximation methods, however, remain expensive in some cases. In this paper, we exploit the knowledge about the structure of the dataset valuation problem to devise more efficient Shapley value estimators. We propose a novel approximation, referred to as discrete uniform Shapley, which is expressed as an expectation under a discrete uniform distribution with support of reasonable size. We justify the relevancy of the proposed framework via asymptotic and non-asymptotic theoretical guarantees and illustrate its benefits via an extensive set of numerical experiments.
