Table of Contents
Fetching ...

DeRDaVa: Deletion-Robust Data Valuation for Machine Learning

Xiao Tian, Rachael Hwee Ling Sim, Jue Fan, Bryan Kian Hsiang Low

TL;DR

This work tackles data valuation under data deletion by introducing DeRDaVa, a deletion-robust valuation framework that anticipates future deletions via a random staying set D with distribution P_D. It builds on semivalue theory by deriving a NPO-consistent extension Φ and defining τ_i(v) = E_{D}[ I[d_i∈D] · φ_i^{|D|}(v) ], ensuring fairness despite deletions. To accommodate different risk preferences, it extends to Risk-DeRDaVa using coalitional CVaR (C-CVaR^{∓}_α) to model worst/best-case utilities. The paper provides Monte-Carlo and 012-MCMC approximation methods with theoretical guarantees and validates the approach on real datasets, showing that DeRDaVa favors staying, non-redundant, and high-quality data while avoiding recomputation after deletions. Overall, DeRDaVa offers a scalable, principled, and deletion-aware data valuation framework with practical applicability for regulated and privacy-conscious ML deployments.

Abstract

Data valuation is concerned with determining a fair valuation of data from data sources to compensate them or to identify training examples that are the most or least useful for predictions. With the rising interest in personal data ownership and data protection regulations, model owners will likely have to fulfil more data deletion requests. This raises issues that have not been addressed by existing works: Are the data valuation scores still fair with deletions? Must the scores be expensively recomputed? The answer is no. To avoid recomputations, we propose using our data valuation framework DeRDaVa upfront for valuing each data source's contribution to preserving robust model performance after anticipated data deletions. DeRDaVa can be efficiently approximated and will assign higher values to data that are more useful or less likely to be deleted. We further generalize DeRDaVa to Risk-DeRDaVa to cater to risk-averse/seeking model owners who are concerned with the worst/best-cases model utility. We also empirically demonstrate the practicality of our solutions.

DeRDaVa: Deletion-Robust Data Valuation for Machine Learning

TL;DR

This work tackles data valuation under data deletion by introducing DeRDaVa, a deletion-robust valuation framework that anticipates future deletions via a random staying set D with distribution P_D. It builds on semivalue theory by deriving a NPO-consistent extension Φ and defining τ_i(v) = E_{D}[ I[d_i∈D] · φ_i^{|D|}(v) ], ensuring fairness despite deletions. To accommodate different risk preferences, it extends to Risk-DeRDaVa using coalitional CVaR (C-CVaR^{∓}_α) to model worst/best-case utilities. The paper provides Monte-Carlo and 012-MCMC approximation methods with theoretical guarantees and validates the approach on real datasets, showing that DeRDaVa favors staying, non-redundant, and high-quality data while avoiding recomputation after deletions. Overall, DeRDaVa offers a scalable, principled, and deletion-aware data valuation framework with practical applicability for regulated and privacy-conscious ML deployments.

Abstract

Data valuation is concerned with determining a fair valuation of data from data sources to compensate them or to identify training examples that are the most or least useful for predictions. With the rising interest in personal data ownership and data protection regulations, model owners will likely have to fulfil more data deletion requests. This raises issues that have not been addressed by existing works: Are the data valuation scores still fair with deletions? Must the scores be expensively recomputed? The answer is no. To avoid recomputations, we propose using our data valuation framework DeRDaVa upfront for valuing each data source's contribution to preserving robust model performance after anticipated data deletions. DeRDaVa can be efficiently approximated and will assign higher values to data that are more useful or less likely to be deleted. We further generalize DeRDaVa to Risk-DeRDaVa to cater to risk-averse/seeking model owners who are concerned with the worst/best-cases model utility. We also empirically demonstrate the practicality of our solutions.
Paper Structure (46 sections, 5 theorems, 31 equations, 14 figures, 1 table, 2 algorithms)

This paper contains 46 sections, 5 theorems, 31 equations, 14 figures, 1 table, 2 algorithms.

Key Result

Theorem 1

[NPO-extension] Every semivalue $\phi^n: G^n \to \mathbb{R}^n$ can be uniquely extended to a sequence of semivalues $\Phi = \langle \phi^k : k = 1, 2, \cdots, n \rangle$ that is NPO-consistent through the following unified NPO-extension process:

Figures (14)

  • Figure 1: Comparison of Data Shapley vs. the deletion-robustified DeRDaVa and Risk-DeRDaVa scores with Shapley prior for a game with $2$interchangeable sources. $\bigstar$ and $\blacksquare$ stay with probability $1$ and $.7$ respectively. $\bigstar$ and $\blacksquare$ have equal Data Shapley score but $\bigstar$ has higher DeRDaVa and Risk-averse DeRDaVa scores. This is because Data Shapley (Eq. (\ref{['eqn:semivalue']})) considers only the initial support set $\{\bigstar, \blacksquare\}$ while DeRDaVa (Eq. (\ref{['eqn:urdava']})) and Risk-averse DeRDaVa (Eq. (\ref{['eqn:risk-urdava']})) also consider the worst-case support set $\{\bigstar\}$. Further explanation is included in App. \ref{['appendix:numerical-comparison']}.
  • Figure 2: Model owners with different risk attitudes will map the random utility function $V(S)$ evaluated at coalition $S$ to a deterministic value differently. The risk-neutral owner (\ref{['img:risk-neutral']}) takes expectation (blue) over all possible utilities. A risk-averse (\ref{['img:risk-averse']})/ risk-seeking (\ref{['img:risk-seeking']}) owner takes expectation over the lower/worst $0.6$-tail and upper/best $0.6$-tail respectively.
  • Figure 3: DeRDaVa accounts for data deletions. (\ref{['fig:sp-10-diabetes-svm-beta-16-4']}) (11 data sources) and (\ref{['fig:sp-20-creditcard-nb-banzhaf']}) (21 data sources) show the effect of staying probability on DeRDaVa scores with Beta Shapley and Data Banzhaf prior; (\ref{['fig:ds-synthetic-dataset-visualization']}) and (\ref{['fig:ds-synthetic-dataset-result']}) show when DeRDaVa score of a redundant data source exceeds its Banzhaf score.
  • Figure 4: Point addition and removal experiments. All experiments are run using [NB-Wd], $100$ data sources and Data Banzhaf prior.
  • Figure 5: When data sources stay with independent (\ref{['fig:dd-10-phoneme-logistic-shapley']}) and dependent (\ref{['fig:dd-10-wind-nb-beta-16-1']}) probabilities, the recomputed semivalue scores of $10$ data sources always converge to DeRDaVa scores and deviate from pre-deletion scores; (\ref{['fig:rd-10-phoneme-logistic-shapley']}) and (\ref{['fig:rd-10-diabetes-nb-beta-16-4']}) compare Risk-DeRDaVa with DeRDaVa and semivalues.
  • ...and 9 more figures

Theorems & Definitions (11)

  • Definition 1
  • Definition 2
  • Definition 3
  • Theorem 1
  • Definition 4
  • Theorem 2
  • Definition 5
  • Theorem 3
  • Theorem 4
  • Definition 6
  • ...and 1 more