Table of Contents
Fetching ...

Regression-adjusted Monte Carlo Estimators for Shapley Values and Probabilistic Values

R. Teal Witter, Yurong Liu, Christopher Musco

TL;DR

Regression MSR tackles efficient estimation of probabilistic values, including Shapley, Beta Shapley, and weighted Banzhaf, by blending Monte Carlo sampling with regression-based variance reduction. It learns a function $f$ to approximate the value function $v$, yielding an unbiased estimator $\tilde{\boldsymbol{\phi}}$ via $\tilde{\phi}_i = \phi_i(f) + \frac{1}{|\mathcal S|} \sum_{S\in \mathcal S} [v(S) - f(S)] \frac{p_{|S|-1}\mathbf{1}[i\in S] - p_{|S|}\mathbf{1}[i\notin S]}{\mathcal D(S)}$, while retaining the Maximum Sample Reuse property. The framework supports linear and tree-based models (Linear MSR, Tree MSR), and provides a general, unbiased approach with a probabilistic-value-specific error bound. Empirical results across eight datasets show state-of-the-art accuracy, with Tree MSR delivering large gains over prior estimators for Shapley and broader probabilistic values, and robustness to noisy access to the value function. The work contributes a flexible, reproducible method that scales to realistic model explanations and data attribution tasks.

Abstract

With origins in game theory, probabilistic values like Shapley values, Banzhaf values, and semi-values have emerged as a central tool in explainable AI. They are used for feature attribution, data attribution, data valuation, and more. Since all of these values require exponential time to compute exactly, research has focused on efficient approximation methods using two techniques: Monte Carlo sampling and linear regression formulations. In this work, we present a new way of combining both of these techniques. Our approach is more flexible than prior algorithms, allowing for linear regression to be replaced with any function family whose probabilistic values can be computed efficiently. This allows us to harness the accuracy of tree-based models like XGBoost, while still producing unbiased estimates. From experiments across eight datasets, we find that our methods give state-of-the-art performance for estimating probabilistic values. For Shapley values, the error of our methods can be $6.5\times$ lower than Permutation SHAP (the most popular Monte Carlo method), $3.8\times$ lower than Kernel SHAP (the most popular linear regression method), and $2.6\times$ lower than Leverage SHAP (the prior state-of-the-art Shapley value estimator). For more general probabilistic values, we can obtain error $215\times$ lower than the best estimator from prior work.

Regression-adjusted Monte Carlo Estimators for Shapley Values and Probabilistic Values

TL;DR

Regression MSR tackles efficient estimation of probabilistic values, including Shapley, Beta Shapley, and weighted Banzhaf, by blending Monte Carlo sampling with regression-based variance reduction. It learns a function to approximate the value function , yielding an unbiased estimator via , while retaining the Maximum Sample Reuse property. The framework supports linear and tree-based models (Linear MSR, Tree MSR), and provides a general, unbiased approach with a probabilistic-value-specific error bound. Empirical results across eight datasets show state-of-the-art accuracy, with Tree MSR delivering large gains over prior estimators for Shapley and broader probabilistic values, and robustness to noisy access to the value function. The work contributes a flexible, reproducible method that scales to realistic model explanations and data attribution tasks.

Abstract

With origins in game theory, probabilistic values like Shapley values, Banzhaf values, and semi-values have emerged as a central tool in explainable AI. They are used for feature attribution, data attribution, data valuation, and more. Since all of these values require exponential time to compute exactly, research has focused on efficient approximation methods using two techniques: Monte Carlo sampling and linear regression formulations. In this work, we present a new way of combining both of these techniques. Our approach is more flexible than prior algorithms, allowing for linear regression to be replaced with any function family whose probabilistic values can be computed efficiently. This allows us to harness the accuracy of tree-based models like XGBoost, while still producing unbiased estimates. From experiments across eight datasets, we find that our methods give state-of-the-art performance for estimating probabilistic values. For Shapley values, the error of our methods can be lower than Permutation SHAP (the most popular Monte Carlo method), lower than Kernel SHAP (the most popular linear regression method), and lower than Leverage SHAP (the prior state-of-the-art Shapley value estimator). For more general probabilistic values, we can obtain error lower than the best estimator from prior work.

Paper Structure

This paper contains 15 sections, 2 theorems, 23 equations, 8 figures, 4 tables, 2 algorithms.

Key Result

Theorem 2.1

The estimates produced by Algorithm alg:ours are unbiased estimates of the probabilistic values. Further, let $\epsilon, \delta > 0$, and $f_{\max}$ be the learned function $f^{(\ell)}$ with largest generalization error over $\ell \in [k]$. When run with $m = O(n \frac{1}{\epsilon\delta})$ samples,

Figures (8)

  • Figure 1: Predicted versus true (normalized) Shapley values for three unbiased estimators given a fixed number of black-box evaluations of the value function, $v$. Each point represents one feature's estimated vs true Shapley value on one dataset. The Monte Carlo estimator uses each sample to estimate only one Shapley value, but has variance that depends on the difference in values between neighboring sets, i.e., $[v(S \cup \{i\}) - v(S)]^2$. The Maximum Sample Reuse (MSR) estimator reuses samples, but has larger variance that depends on the magnitude of the values, i.e., $[v(S)]^2$. Our Regression MSR estimators reuse samples and have smaller variance that depends on how well a learned function $f$ fits the value function $v$, i.e., $[v(S) - f(S)]^2$. Even taking $f$ to be linear gives excellent performance (we call this method Linear MSR). Taking $f$ to be a decision-tree model (Tree MSR) can produce even better estimates for large sample sizes, as shown in Figure \ref{['fig:shapley_sample_size']}.
  • Figure 2: Average $\ell_2$-error between estimated and true Shapley values as a function of sample size $m$ (number of evaluations of $v$) for various datasets. The lines report the mean error over 100 runs, and $m=10n, 20n, 40n, 80n, 160n, 320n, 640n$. Linear MSR consistently performs comparably to the prior state-of-the-art Leverage SHAP. Meanwhile, the performance of Tree MSR depends on how well the tree-based model approximates the value function; with more samples, it can even outperform Leverage SHAP by several orders of magnitude.
  • Figure 3: Average error between the estimated and true probabilistic values as a function of sample size. Each subplot shows results for a different probabilistic value with the error averaged over all large datasets ($n \geq 30$), for which we used the tree-based method described in Appendix \ref{['appendix:tree_prob']}. The lines report the mean error over 10 runs. Tree MSR gives the best performance, often by several orders of magnitude when $m$ is large.
  • Figure 4: Generalization error between value function $v$ and learned model $f$ by sample size, averaged over all datasets. When the base model is linear, the learned linear model quickly fits it to machine precision. When the base model is a random forest or a neural network, the error of the linear model plateaus while the random forest and XGBoost learned models continue to improve. This phenomenon is reflected in Figures \ref{['fig:prob_complexity_big_n']} and \ref{['fig:prob_complexity_small_n']}; the performance of Tree MSR continues to improve with the number of samples while Linear MSR plateaus.
  • Figure 5: Probabilistic values by subset size for $n=10$. Beta Shapley values B$(\alpha, \beta)$ generalize Shapley values for $\alpha, \beta \in [1, \infty)$; increasing both $\alpha$ and $\beta$ flattens beta Shapley values while increasing just $\alpha$ (or just $\beta$) tilts beta Shapley values. Weighted Banzhaf values WB($p$) generalize Banzhaf values for $p \in (0,1)$; increasing (or decreasing) $p$ tilts weighted Banzhaf values.
  • ...and 3 more figures

Theorems & Definitions (3)

  • Theorem 2.1: Regression-Adjustment Guarantee
  • Theorem A.1: Regression-Adjustment Guarantee
  • proof : Proof of Theorem \ref{['thm:error_bound']}