Table of Contents
Fetching ...

Data value estimation on private gradients

Zijian Zhou, Xinyi Xu, Daniela Rus, Bryan Kian Hsiang Low

TL;DR

This work shows that standard DP via i.i.d. gradient perturbations causes data-value estimates to become unreliable as the evaluation budget grows, due to estimation-uncertainty scaling with $k$. It introduces a correlated-noise framework that reuses private gradients across iterations through a matrix-based combination, reducing estimation-uncertainty to $ ilde{O}(1)$ in theory and to practical improvements in experiments. The method applies to gradient-based semivalues such as Shapley, Beta Shapley, and Banzhaf, and demonstrates stronger data pruning, better detection of mislabeled data, and applicability to dataset valuation and federated learning. The approach balances privacy with informative valuation by leveraging post-processing immunity and burn-in strategies, offering a scalable path for privacy-aware data attribution in distributed ML settings.

Abstract

For gradient-based machine learning (ML) methods commonly adopted in practice such as stochastic gradient descent, the de facto differential privacy (DP) technique is perturbing the gradients with random Gaussian noise. Data valuation attributes the ML performance to the training data and is widely used in privacy-aware applications that require enforcing DP such as data pricing, collaborative ML, and federated learning (FL). Can existing data valuation methods still be used when DP is enforced via gradient perturbations? We show that the answer is no with the default approach of injecting i.i.d.~random noise to the gradients because the estimation uncertainty of the data value estimation paradoxically linearly scales with more estimation budget, producing estimates almost like random guesses. To address this issue, we propose to instead inject carefully correlated noise to provably remove the linear scaling of estimation uncertainty w.r.t.~the budget. We also empirically demonstrate that our method gives better data value estimates on various ML tasks and is applicable to use cases including dataset valuation and~FL.

Data value estimation on private gradients

TL;DR

This work shows that standard DP via i.i.d. gradient perturbations causes data-value estimates to become unreliable as the evaluation budget grows, due to estimation-uncertainty scaling with . It introduces a correlated-noise framework that reuses private gradients across iterations through a matrix-based combination, reducing estimation-uncertainty to in theory and to practical improvements in experiments. The method applies to gradient-based semivalues such as Shapley, Beta Shapley, and Banzhaf, and demonstrates stronger data pruning, better detection of mislabeled data, and applicability to dataset valuation and federated learning. The approach balances privacy with informative valuation by leveraging post-processing immunity and burn-in strategies, offering a scalable path for privacy-aware data attribution in distributed ML settings.

Abstract

For gradient-based machine learning (ML) methods commonly adopted in practice such as stochastic gradient descent, the de facto differential privacy (DP) technique is perturbing the gradients with random Gaussian noise. Data valuation attributes the ML performance to the training data and is widely used in privacy-aware applications that require enforcing DP such as data pricing, collaborative ML, and federated learning (FL). Can existing data valuation methods still be used when DP is enforced via gradient perturbations? We show that the answer is no with the default approach of injecting i.i.d.~random noise to the gradients because the estimation uncertainty of the data value estimation paradoxically linearly scales with more estimation budget, producing estimates almost like random guesses. To address this issue, we propose to instead inject carefully correlated noise to provably remove the linear scaling of estimation uncertainty w.r.t.~the budget. We also empirically demonstrate that our method gives better data value estimates on various ML tasks and is applicable to use cases including dataset valuation and~FL.

Paper Structure

This paper contains 60 sections, 13 theorems, 121 equations, 11 figures, 8 tables, 1 algorithm.

Key Result

Proposition 5.1

(I.I.D. Noise) $\forall t \in[k]$, denote $\boldsymbol{\theta}_{\pi^t} \coloneqq \boldsymbol{\theta}^p_{\pi^t} - \alpha\tilde{g}_{\pi^t} = \boldsymbol{\theta}^p_{\pi^t} - \alpha(\hat{g}_{\pi^t} + z_t)$ where $\forall t \in [k], z_t \overset{\text{i.i.d.}}{\sim} \mathcal{N}(\boldsymbol{0}, k(C\sigma) or the negated $\ell_2$-regularized cross-entropy loss on a logistic regression model where $\text

Figures (11)

  • Figure 1: Summary of theoretical results. $\sigma_g^2$ refers to the average variance of unperturbed gradients. $q \in (0, 1)$ is a hyperparameter.
  • Figure 2: (a) $n^{-1}\sum_{j \in [n]} s_j^2 / |\mu_j|$ and (b) $\mu_j$ vs. $k$ using i.i.d. noise and correlated noise. (c) error bars of test accuracy vs. ratio of data removed with the highest$\psi$'s using different $k$ with i.i.d. noise and (d) also with correlated noise and no DP. "Random" means random removal.
  • Figure 2: Mean (std. errors) of AUC on Covertype trained with LR (top) and MNIST trained with CNN (bottom). The best score is highlighted. Higher is better.
  • Figure 3: Plots of AUC v.s. burn-in ratio $q \in [0,1)$ (with $q=0$ equivalent to $\boldsymbol{X}^*$). $V$ is (left and middle) negated test loss and (right) test accuracy. Lines represent mean and shades represent $1$ standard deviation. Higher is better.
  • Figure 4: Plot of $f$ vs. $v \in [1, 50]$.
  • ...and 6 more figures

Theorems & Definitions (26)

  • Definition 3.1: $(\epsilon,\delta)$-Differential Privacy dwork2014
  • Proposition 5.1
  • Definition 5.2: Diagonal Multivariate Sub-Gaussian Distribution
  • Proposition 5.3: Correlated Noise with $\boldsymbol{X}$, informal
  • Proposition 5.4: Correlated Noise with $\boldsymbol{Y}$, informal
  • proof
  • Lemma C.2
  • proof
  • Proposition C.3: Reproduced from \ref{['prop:concentration_iid']}.
  • proof : Proof of \ref{['prop:concentration_iid']}.
  • ...and 16 more