Efficient Data Shapley for Weighted Nearest Neighbor Algorithms
Jiachen T. Wang, Prateek Mittal, Ruoxi Jia
TL;DR
This paper addresses the computational bottleneck of Data Shapley for weighted K-Nearest Neighbors (WKNN-Shapley) by reframing the problem under hard-label KNN with discretized weights, which eliminates normalization-induced complexity. It develops a quadratic-time exact WKNN-Shapley algorithm with runtime $O(W K^2 N^2)$ and a deterministic approximation with runtime $O(W K^2 N M^\star)$, where $W=2^b$ is the discretized weight space and $M^\star$ is a tunable cutoff. The core innovation is a counting-based formulation that reduces the Shapley computation to sums over carefully defined counts $\texttt{G}_{i,\ell}$ and a dynamic-programming construction using $\texttt{F}_i[m,\ell,s]$, plus a short-cut that achieves additional efficiency. Empirical results show substantial speedups over the ${O(N^K)}$ baseline and Monte Carlo methods, with WKNN-Shapley consistently outperforming unweighted KNN-Shapley in discerning data quality across mislabeled and noisy data tasks; the deterministic approximation also frequently matches the exact method while preserving fairness properties. These advances enable practical data valuation with WKNN and suggest potential extensions to other learning paradigms through problem-aware utility reformulations.
Abstract
This work aims to address an open problem in data valuation literature concerning the efficient computation of Data Shapley for weighted $K$ nearest neighbor algorithm (WKNN-Shapley). By considering the accuracy of hard-label KNN with discretized weights as the utility function, we reframe the computation of WKNN-Shapley into a counting problem and introduce a quadratic-time algorithm, presenting a notable improvement from $O(N^K)$, the best result from existing literature. We develop a deterministic approximation algorithm that further improves computational efficiency while maintaining the key fairness properties of the Shapley value. Through extensive experiments, we demonstrate WKNN-Shapley's computational efficiency and its superior performance in discerning data quality compared to its unweighted counterpart.
