Table of Contents
Fetching ...

Geometric Data Valuation via Leverage Scores

Rodrigo Mendoza-Smith

TL;DR

The paper tackles the computational burden of data Shapley valuation by introducing ridge leverage scores as a geometric proxy that measures datapoint influence via subspace span and effective dimensionality. It establishes that leverage-based valuations satisfy key axioms (Symmetry, Efficiency, Dummy in the non-ridge case) and uses ridge extensions to avoid hard dimensional saturation, connecting to A- and D-optimal design criteria. The authors prove $\varepsilon$-close decision-quality guarantees for ridge regression when downsampling with ridge-leverage probabilities and demonstrate strong empirical performance in a gradient-free active learning setting on MNIST. The work offers a scalable, model-agnostic data-valuation framework with theoretical guarantees and practical applicability to data-efficient learning and subset selection.

Abstract

Shapley data valuation provides a principled, axiomatic framework for assigning importance to individual datapoints, and has gained traction in dataset curation, pruning, and pricing. However, it is a combinatorial measure that requires evaluating marginal utility across all subsets of the data, making it computationally infeasible at scale. We propose a geometric alternative based on statistical leverage scores, which quantify each datapoint's structural influence in the representation space by measuring how much it extends the span of the dataset and contributes to the effective dimensionality of the training problem. We show that our scores satisfy the dummy, efficiency, and symmetry axioms of Shapley valuation and that extending them to \emph{ridge leverage scores} yields strictly positive marginal gains that connect naturally to classical A- and D-optimal design criteria. We further show that training on a leverage-sampled subset produces a model whose parameters and predictive risk are within $O(\varepsilon)$ of the full-data optimum, thereby providing a rigorous link between data valuation and downstream decision quality. Finally, we conduct an active learning experiment in which we empirically demonstrate that ridge-leverage sampling outperforms standard baselines without requiring access gradients or backward passes.

Geometric Data Valuation via Leverage Scores

TL;DR

The paper tackles the computational burden of data Shapley valuation by introducing ridge leverage scores as a geometric proxy that measures datapoint influence via subspace span and effective dimensionality. It establishes that leverage-based valuations satisfy key axioms (Symmetry, Efficiency, Dummy in the non-ridge case) and uses ridge extensions to avoid hard dimensional saturation, connecting to A- and D-optimal design criteria. The authors prove -close decision-quality guarantees for ridge regression when downsampling with ridge-leverage probabilities and demonstrate strong empirical performance in a gradient-free active learning setting on MNIST. The work offers a scalable, model-agnostic data-valuation framework with theoretical guarantees and practical applicability to data-efficient learning and subset selection.

Abstract

Shapley data valuation provides a principled, axiomatic framework for assigning importance to individual datapoints, and has gained traction in dataset curation, pruning, and pricing. However, it is a combinatorial measure that requires evaluating marginal utility across all subsets of the data, making it computationally infeasible at scale. We propose a geometric alternative based on statistical leverage scores, which quantify each datapoint's structural influence in the representation space by measuring how much it extends the span of the dataset and contributes to the effective dimensionality of the training problem. We show that our scores satisfy the dummy, efficiency, and symmetry axioms of Shapley valuation and that extending them to \emph{ridge leverage scores} yields strictly positive marginal gains that connect naturally to classical A- and D-optimal design criteria. We further show that training on a leverage-sampled subset produces a model whose parameters and predictive risk are within of the full-data optimum, thereby providing a rigorous link between data valuation and downstream decision quality. Finally, we conduct an active learning experiment in which we empirically demonstrate that ridge-leverage sampling outperforms standard baselines without requiring access gradients or backward passes.

Paper Structure

This paper contains 8 sections, 6 theorems, 59 equations, 1 figure.

Key Result

Theorem 1

Let $\mathbf{X} \in \mathbb{R}^{n \times d}$ be a data matrix with rows $\mathbf{x}_1^\top, \dots, \mathbf{x}_n^\top$ and define Then, if $\mathop{\mathrm{rank}}\nolimits(\mathbf{X}) = d$, $\phi_U$ satisfies the symmetry eq:symmetry, efficiency eq:efficiency, and dummy eq:dummy axioms of data Shapley for $U(S) := \mathop{\mathrm{span}}\nolimits\left\{ \mathbf{x}_i : i \in S\right\}$ for all $S \s

Figures (1)

  • Figure 1: (Top) Test accuracy versus number of labeled samples for six AL strategies on MNIST. (Bottom) Final test accuracy after 40 acquisition rounds.

Theorems & Definitions (10)

  • Theorem 1: Shapley axioms
  • Proposition 2: Shapley axioms for Ridge leverage
  • Theorem 3: $\varepsilon$-close to the full-data ridge solution
  • proof
  • Theorem 4: Matrix Chernoff tropp2012user
  • Lemma 5: Bounds on scalar factors
  • proof
  • Lemma 6: Ridge contraction in the $\|\cdot\|_{\mathbf{A}}$ norm
  • proof
  • proof