Table of Contents
Fetching ...

Efficient Data Shapley for Weighted Nearest Neighbor Algorithms

Jiachen T. Wang, Prateek Mittal, Ruoxi Jia

TL;DR

This paper addresses the computational bottleneck of Data Shapley for weighted K-Nearest Neighbors (WKNN-Shapley) by reframing the problem under hard-label KNN with discretized weights, which eliminates normalization-induced complexity. It develops a quadratic-time exact WKNN-Shapley algorithm with runtime $O(W K^2 N^2)$ and a deterministic approximation with runtime $O(W K^2 N M^\star)$, where $W=2^b$ is the discretized weight space and $M^\star$ is a tunable cutoff. The core innovation is a counting-based formulation that reduces the Shapley computation to sums over carefully defined counts $\texttt{G}_{i,\ell}$ and a dynamic-programming construction using $\texttt{F}_i[m,\ell,s]$, plus a short-cut that achieves additional efficiency. Empirical results show substantial speedups over the ${O(N^K)}$ baseline and Monte Carlo methods, with WKNN-Shapley consistently outperforming unweighted KNN-Shapley in discerning data quality across mislabeled and noisy data tasks; the deterministic approximation also frequently matches the exact method while preserving fairness properties. These advances enable practical data valuation with WKNN and suggest potential extensions to other learning paradigms through problem-aware utility reformulations.

Abstract

This work aims to address an open problem in data valuation literature concerning the efficient computation of Data Shapley for weighted $K$ nearest neighbor algorithm (WKNN-Shapley). By considering the accuracy of hard-label KNN with discretized weights as the utility function, we reframe the computation of WKNN-Shapley into a counting problem and introduce a quadratic-time algorithm, presenting a notable improvement from $O(N^K)$, the best result from existing literature. We develop a deterministic approximation algorithm that further improves computational efficiency while maintaining the key fairness properties of the Shapley value. Through extensive experiments, we demonstrate WKNN-Shapley's computational efficiency and its superior performance in discerning data quality compared to its unweighted counterpart.

Efficient Data Shapley for Weighted Nearest Neighbor Algorithms

TL;DR

This paper addresses the computational bottleneck of Data Shapley for weighted K-Nearest Neighbors (WKNN-Shapley) by reframing the problem under hard-label KNN with discretized weights, which eliminates normalization-induced complexity. It develops a quadratic-time exact WKNN-Shapley algorithm with runtime and a deterministic approximation with runtime , where is the discretized weight space and is a tunable cutoff. The core innovation is a counting-based formulation that reduces the Shapley computation to sums over carefully defined counts and a dynamic-programming construction using , plus a short-cut that achieves additional efficiency. Empirical results show substantial speedups over the baseline and Monte Carlo methods, with WKNN-Shapley consistently outperforming unweighted KNN-Shapley in discerning data quality across mislabeled and noisy data tasks; the deterministic approximation also frequently matches the exact method while preserving fairness properties. These advances enable practical data valuation with WKNN and suggest potential extensions to other learning paradigms through problem-aware utility reformulations.

Abstract

This work aims to address an open problem in data valuation literature concerning the efficient computation of Data Shapley for weighted nearest neighbor algorithm (WKNN-Shapley). By considering the accuracy of hard-label KNN with discretized weights as the utility function, we reframe the computation of WKNN-Shapley into a counting problem and introduce a quadratic-time algorithm, presenting a notable improvement from , the best result from existing literature. We develop a deterministic approximation algorithm that further improves computational efficiency while maintaining the key fairness properties of the Shapley value. Through extensive experiments, we demonstrate WKNN-Shapley's computational efficiency and its superior performance in discerning data quality compared to its unweighted counterpart.
Paper Structure (50 sections, 21 theorems, 45 equations, 14 figures, 5 tables, 4 algorithms)

This paper contains 50 sections, 21 theorems, 45 equations, 14 figures, 5 tables, 4 algorithms.

Key Result

Theorem 2

For any data point $z_i \in D$ and any subset $S \subseteq D \setminus \{z_i\}$, the marginal contribution has the expression as follows: where

Figures (14)

  • Figure 1: Illustration of the subsets targeted in the counting problem. When $K=3$, both $S_1$ and $S_2$ have a utility of 0 as both of them contain 2 dogs and 1 cat. Adding $z_i$ to $S_1$ and $S_2$ alters the $3$ nearest neighbors to the query image $x^{(\mathrm{val})}$, which now contains 1 dog and 2 cats, raising the utility to 1. In contrast, $S_3$'s utility remains unchanged with the addition of $z_i$ since it solely contains cat images. To compute WKNN-Shapley of $z_i$, we count the subsets $S$ where adding $z_i$ changes its utility, as seen with $S_1$ and $S_2$.
  • Figure 2: Runtime comparison between our exact and approximation algorithms for WKNN-Shapley in Section \ref{['sec:shapley-for-binary']}, and those from jia2019efficient, across varying training data sizes $N$. We set $K = 5$ and the weights are discretized to 3-bit here. In Appendix \ref{['appendix:eval']}, we provide additional experiments on different $K$s and $b$s. For our deterministic approximation algorithm, we set $M^\star = \sqrt{N}$ (so that the time complexity is $O(N^{1.5})$). For the Monte Carlo approximation from jia2019efficient, we align the error bounds to be the same as ours for fair comparison; we set the failure probability for Monte Carlo method as $\delta = 0.1$. The plot shows the average runtime based on 5 independent runs.
  • Figure 3: AUROC scores of different variants of KNN-Shapley for mislabeled data detection with different $K$s. The higher the curve is, the better the method is.
  • Figure 4: Convergence of the discretization error with the number of bits growth. The $y$-axis shows the $\ell_2$ or $\ell_\infty$ norm of the difference between the Shapley values computed based on $b$ bits and $b+1$ bits. The lower, the better. We use Fraud dataset from OpenML dal2015calibrating, and we use $K=5$ here.
  • Figure 5: Distributions of WKNN-Shapley on different sizes of the subset of Fraud dataset from OpenML dal2015calibrating (the number of bits for discretization $b = 5$ and $K = 5$).
  • ...and 9 more figures

Theorems & Definitions (46)

  • Definition 1: Shapley value shapley1953value
  • Remark 1
  • Remark 2
  • Theorem 2
  • Definition 3
  • Theorem 4
  • Definition 5
  • Theorem 6: Relation between $\texttt{G}_{i, \ell}$ and $\texttt{F}_i$
  • Theorem 7: simplified version
  • Theorem 8
  • ...and 36 more