DU-Shapley: A Shapley Value Proxy for Efficient Dataset Valuation

Felipe Garrido-Lucero; Benjamin Heymann; Maxime Vono; Patrick Loiseau; Vianney Perchet

DU-Shapley: A Shapley Value Proxy for Efficient Dataset Valuation

Felipe Garrido-Lucero, Benjamin Heymann, Maxime Vono, Patrick Loiseau, Vianney Perchet

TL;DR

This work tackles the dataset valuation problem by leveraging Shapley values to quantify each data owner’s marginal contribution to a collaborative ML task. It introduces DU-Shapley, a discrete uniform Shapley proxy that reduces the number of required utility evaluations from exponential to linear in the number of data owners by exploiting a homogeneous, size-driven structure of the utility via a function $w(n)$. The authors provide both asymptotic convergence results and non-asymptotic error bounds, along with extensive empirical validation on synthetic and real-world datasets, showing that DU-Shapley outperforms or matches existing Monte Carlo Shapley estimators under budget constraints. The method enables efficient and scalable data valuation in settings with many data owners, while also highlighting limitations related to mean-field assumptions and suggesting paths toward a unified theory for Shapley estimators that exploit data structure. Overall, DU-Shapley offers a practical, theory-backed tool for incentive design and fair data sharing in collaborative ML applications.

Abstract

We consider the dataset valuation problem, that is, the problem of quantifying the incremental gain, to some relevant pre-defined utility of a machine learning task, of aggregating an individual dataset to others. The Shapley value is a natural tool to perform dataset valuation due to its formal axiomatic justification, which can be combined with Monte Carlo integration to overcome the computational tractability challenges. Such generic approximation methods, however, remain expensive in some cases. In this paper, we exploit the knowledge about the structure of the dataset valuation problem to devise more efficient Shapley value estimators. We propose a novel approximation, referred to as discrete uniform Shapley, which is expressed as an expectation under a discrete uniform distribution with support of reasonable size. We justify the relevancy of the proposed framework via asymptotic and non-asymptotic theoretical guarantees and illustrate its benefits via an extensive set of numerical experiments.

DU-Shapley: A Shapley Value Proxy for Efficient Dataset Valuation

TL;DR

. The authors provide both asymptotic convergence results and non-asymptotic error bounds, along with extensive empirical validation on synthetic and real-world datasets, showing that DU-Shapley outperforms or matches existing Monte Carlo Shapley estimators under budget constraints. The method enables efficient and scalable data valuation in settings with many data owners, while also highlighting limitations related to mean-field assumptions and suggesting paths toward a unified theory for Shapley estimators that exploit data structure. Overall, DU-Shapley offers a practical, theory-backed tool for incentive design and fair data sharing in collaborative ML applications.

Abstract

Paper Structure (26 sections, 11 theorems, 75 equations, 3 figures, 2 tables, 1 algorithm)

This paper contains 26 sections, 11 theorems, 75 equations, 3 figures, 2 tables, 1 algorithm.

Introduction
Preliminaries
Problem Formulation
Shapley Value
Discrete Uniform Shapley Value
Algorithm
Homogeneous Setting
Non-Asymptotic Theoretical Guarantees
Numerical Experiments
Synthetic Data
Real-World Data
Conclusion
Impact Statement
Notations and conventions.
Additional details regarding the running example 1
...and 11 more sections

Key Result

Proposition 1

Define $C = \mathbb{E}_{x \sim p_X^{\mathrm{test}}}[x x^\top]$ and assume that $p_X = \mathrm{N}(0_d, \Sigma)$ where $\Sigma \in \mathbb{R}^{d \times d}$ is a positive definite matrix. Then, whenever $n_{\mathcal{S}} > d + 1$, where $u$ is defined in eq:utility_lin_reg and $\sigma_{\varepsilon}^2 = \mathbb{E}_{\varepsilon \sim p_\varepsilon}[\varepsilon^2 ]$. For the specific choice $p_X^{\mathrm

Figures (3)

Figure 1: Illustration of Theorem \ref{['theorem:convergence_uniform']} -- (left) $I = 10$, (right) $I=500$. We choose the player of index $i=500$ to be the $i$-th one, considering $10^5$ samples for each random variable and a number of data points per player drawn from $\mathrm{U}([100])$. The random variable $\bar{n}_{\mathcal{S}_K^{(i)}}$ stands for $n_{\mathcal{S}_K^{(i)}}$ normalised by the total number of data points as in Theorem \ref{['theorem:convergence_uniform']}.
Figure 2: Monte Carlo's expected error for limited sampling budget ($T = I$) versus DU-Shapley's expected bias for a value function $w(n) = 1 - \frac{10^{k(\mathcal{I})}}{10^{k(\mathcal{I})} + n}$ where $k(\mathcal{I}) = \lfloor \log(n_{\mathcal{I}}) \rfloor - 1$. For each value of $I$, we drew $100$ times the data points of each player from $\mathrm{U}([n_{\mathrm{max}}])$, with (left) $n_{\mathrm{max}} = 10^2$ and (right) $n_{\mathrm{max}} = 10^4$.
Figure 3: Worst-case comparison between the proposed methodology (constant number of utility function evaluations equal to $I$, illustrated by the vertical black line), and MC-based approximations on synthetic datasets. From left to right, $I = 5$ and $I = 20$. (top) scenario with small heterogeneity, $\eta = 1$ and (bottom) scenario with high heterogeneity, $\eta=5$.

Theorems & Definitions (24)

Proposition 1
Theorem 1
Definition 1: DU-Shapley in the homogeneous setting
Theorem 2
Proposition S1
proof
Remark S1
Lemma S1
proof
Remark S2
...and 14 more

DU-Shapley: A Shapley Value Proxy for Efficient Dataset Valuation

TL;DR

Abstract

DU-Shapley: A Shapley Value Proxy for Efficient Dataset Valuation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (24)