Local Shapley: Model-Induced Locality and Optimal Reuse in Data Valuation

Xuan Yang; Hsi-Wen Chen; Ming-Syan Chen; Jian Pei

Local Shapley: Model-Induced Locality and Optimal Reuse in Data Valuation

Xuan Yang, Hsi-Wen Chen, Ming-Syan Chen, Jian Pei

TL;DR

It is proved that the intrinsic complexity of Local Shapley is governed by the number of distinct influential subsets, establishing an information-theoretic lower bound on retraining operations.

Abstract

The Shapley value provides a principled foundation for data valuation, but exact computation is #P-hard due to the exponential coalition space. Existing accelerations remain global and ignore a structural property of modern predictors: for a given test instance, only a small subset of training points influences the prediction. We formalize this model-induced locality through support sets defined by the model's computational pathway (e.g., neighbors in KNN, leaves in trees, receptive fields in GNNs), showing that Shapley computation can be projected onto these supports without loss when locality is exact. This reframes Shapley evaluation as a structured data processing problem over overlapping support-induced subset families rather than exhaustive coalition enumeration. We prove that the intrinsic complexity of Local Shapley is governed by the number of distinct influential subsets, establishing an information-theoretic lower bound on retraining operations. Guided by this result, we propose LSMR (Local Shapley via Model Reuse), an optimal subset-centric algorithm that trains each influential subset exactly once via support mapping and pivot scheduling. For larger supports, we develop LSMR-A, a reuse-aware Monte Carlo estimator that remains unbiased with exponential concentration, with runtime determined by the number of distinct sampled subsets rather than total draws. Experiments across multiple model families demonstrate substantial retraining reductions and speedups while preserving high valuation fidelity.

Local Shapley: Model-Induced Locality and Optimal Reuse in Data Valuation

TL;DR

It is proved that the intrinsic complexity of Local Shapley is governed by the number of distinct influential subsets, establishing an information-theoretic lower bound on retraining operations.

Abstract

Paper Structure (62 sections, 24 theorems, 37 equations, 6 figures, 5 tables, 3 algorithms)

This paper contains 62 sections, 24 theorems, 37 equations, 6 figures, 5 tables, 3 algorithms.

Introduction
Related Work
Data Valuation and Shapley-Based Methods.
Scalable Approximations to Shapley Values.
Locality-Aware Shapley Computation.
Comparison with Prior Work.
Local Shapley
Shapley Value for Data Valuation
Utility induced by retraining.
Global Shapley value.
Computational barrier and structural redundancy.
Local Shapley Value
Support sets as structural locality.
When does locality approximate the global game?
Axiomatic properties.
...and 47 more sections

Key Result

Proposition 1

Under Assumption ass:nonlocal-stability, for any $z \in \mathcal{N}(t)$, where $\phi_z(v_t)$ denotes the global Shapley value computed under the original utility $v_t$. ∎

Figures (6)

Figure 1: (a) Intra-support redundancy: the blue area denotes the support set $\mathcal{N}(t)$. For training points $z_i$ and $z_j$, many training subsets are shared in their Shapley value computations. (b) Inter-support redundancy: test points $t_i$ and $t_j$ have overlapping supports (darker blue). Within this overlap, subsets are shared across their Shapley computations. The yellow dotted subset illustrates one such shared training subset.
Figure 2: Scatter plots of Local Shapley (x-axis) versus Global Shapley (y-axis). Dashed lines indicate linear regression fits.
Figure 3: Test accuracy versus the percentage of training data added in descending Shapley order.
Figure : (a) WKNN
Figure : (a) WKNN
...and 1 more figures

Theorems & Definitions (26)

Definition 1: Local Shapley Value
Proposition 1: Approximation of Global Shapley
Proposition 2: Properties of Local Shapley
Remark 1: Architectural Sources of Locality
Lemma 1: Equivalence Under Projected Utility
Lemma 2: Subset-Centric Reformulation
Theorem 1: Optimal Reuse
Corollary 1: Amortized Reuse
Theorem 2: Unbiasedness of LSMR-A
Theorem 3: Concentration of LSMR-A
...and 16 more

Local Shapley: Model-Induced Locality and Optimal Reuse in Data Valuation

TL;DR

Abstract

Local Shapley: Model-Induced Locality and Optimal Reuse in Data Valuation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (26)