Table of Contents
Fetching ...

Uncertainty Quantification of Data Shapley via Statistical Inference

Mengmeng Wu, Zhihong Liu, Xiang Li, Ruoxi Jia, Xiangyu Chang

TL;DR

This work reframes Data Shapley as an infinite-order U-statistic to capture how data distribution shifts affect data valuations, enabling uncertainty quantification through confidence intervals. It establishes asymptotic normality for an incomplete IOUS-based estimator under deletion-stability conditions and introduces two practical variance-estimation algorithms, Double Monte Carlo and Pick-and-Freeze, to support inference. The authors validate the theory with experiments on real datasets, demonstrating convergence to normality, increased interval coverage with more data, and a data-trading case where confidence intervals improve valuation credibility. Overall, the paper delivers a statistically principled framework for robust data valuation in dynamic data environments, with clear guidance for practitioners in data markets.

Abstract

As data plays an increasingly pivotal role in decision-making, the emergence of data markets underscores the growing importance of data valuation. Within the machine learning landscape, Data Shapley stands out as a widely embraced method for data valuation. However, a limitation of Data Shapley is its assumption of a fixed dataset, contrasting with the dynamic nature of real-world applications where data constantly evolves and expands. This paper establishes the relationship between Data Shapley and infinite-order U-statistics and addresses this limitation by quantifying the uncertainty of Data Shapley with changes in data distribution from the perspective of U-statistics. We make statistical inferences on data valuation to obtain confidence intervals for the estimations. We construct two different algorithms to estimate this uncertainty and provide recommendations for their applicable situations. We also conduct a series of experiments on various datasets to verify asymptotic normality and propose a practical trading scenario enabled by this method.

Uncertainty Quantification of Data Shapley via Statistical Inference

TL;DR

This work reframes Data Shapley as an infinite-order U-statistic to capture how data distribution shifts affect data valuations, enabling uncertainty quantification through confidence intervals. It establishes asymptotic normality for an incomplete IOUS-based estimator under deletion-stability conditions and introduces two practical variance-estimation algorithms, Double Monte Carlo and Pick-and-Freeze, to support inference. The authors validate the theory with experiments on real datasets, demonstrating convergence to normality, increased interval coverage with more data, and a data-trading case where confidence intervals improve valuation credibility. Overall, the paper delivers a statistically principled framework for robust data valuation in dynamic data environments, with clear guidance for practitioners in data markets.

Abstract

As data plays an increasingly pivotal role in decision-making, the emergence of data markets underscores the growing importance of data valuation. Within the machine learning landscape, Data Shapley stands out as a widely embraced method for data valuation. However, a limitation of Data Shapley is its assumption of a fixed dataset, contrasting with the dynamic nature of real-world applications where data constantly evolves and expands. This paper establishes the relationship between Data Shapley and infinite-order U-statistics and addresses this limitation by quantifying the uncertainty of Data Shapley with changes in data distribution from the perspective of U-statistics. We make statistical inferences on data valuation to obtain confidence intervals for the estimations. We construct two different algorithms to estimate this uncertainty and provide recommendations for their applicable situations. We also conduct a series of experiments on various datasets to verify asymptotic normality and propose a practical trading scenario enabled by this method.
Paper Structure (22 sections, 4 theorems, 56 equations, 11 figures, 3 tables, 2 algorithms)

This paper contains 22 sections, 4 theorems, 56 equations, 11 figures, 3 tables, 2 algorithms.

Key Result

Theorem 5.1

Let $Z_1,Z_2,\dots \overset{i.i.d}{\sim}\mathcal{D}$, and $\hat{\nu}_s(z;n,m_s)=\frac{1}{m_s}\sum_{1\leq i_1< i_2 < \dots i_s \leq n} h_s(Z_{i_1},\dots,Z_{i_s};z)$ be an incomplete IOUS with kernel $h_s$ that satisfies assumption assump. Let $\nu_s(z;n)=\mathbb{E}h_s(Z_{i_1},Z_{i_2},\dots,Z_{i_s};z)

Figures (11)

  • Figure 1: This illustration represents the data valuation problem within the framework of cooperative game theory.
  • Figure 2: Estimation of values for 50 samples in the FashionMNIST dataset using the Logistic Regression learning algorithm, with the prediction accuracy as the utility function.
  • Figure 3: The behavior of estimated Data Shapley values. Figure \ref{['f2']} presents the empirical coverage of 100 randomly selected points, showing the proportion of sample data values falling within their intervals as the data volume $n$ are set to 100, 300, and 500. Figure \ref{['f12']} displays histograms of the estimated values for a randomly selected sample, with experiments repeated 50 times; the blue line represents the fitted curve, while the black line indicates a normal distribution curve with the same mean and variance. Figure \ref{['f22']} shows the Q-Q plot of 9 samples.
  • Figure 4: The estimated values and confidence intervals for the top 50 points on the FashionMNIST data, with the data volume set to 100.
  • Figure 5: Comparison of DMC and PF algorithms on the Covertype dataset. The left panel (Figure \ref{['il1']}) shows the runtime comparison, with $T=100$ for both algorithms and varying inner loop counts for DMC. The running time of DMC increases rapidly with the number of inner loops, far exceeding that of PF (vertical axis represents the logarithm of time). The right panel (Figure \ref{['il2']}) shows the estimated values of $\zeta_{1,s}$ using DMC with varying $T_i$. As $T_i$ increases, the estimated values become more stable and accurate.
  • ...and 6 more figures

Theorems & Definitions (13)

  • Definition 3.1: Data Shapley
  • Definition 3.2: U-statistics 1948A
  • Definition 3.3: Infinite-Order U-statistics
  • Definition 3.4: Incomplete Infinite-Order U-statistics
  • Remark 4.1
  • Definition 5.1
  • Remark 5.1
  • Theorem 5.1
  • Remark 5.2
  • Theorem 5.2
  • ...and 3 more