DeRDaVa: Deletion-Robust Data Valuation for Machine Learning
Xiao Tian, Rachael Hwee Ling Sim, Jue Fan, Bryan Kian Hsiang Low
TL;DR
This work tackles data valuation under data deletion by introducing DeRDaVa, a deletion-robust valuation framework that anticipates future deletions via a random staying set D with distribution P_D. It builds on semivalue theory by deriving a NPO-consistent extension Φ and defining τ_i(v) = E_{D}[ I[d_i∈D] · φ_i^{|D|}(v) ], ensuring fairness despite deletions. To accommodate different risk preferences, it extends to Risk-DeRDaVa using coalitional CVaR (C-CVaR^{∓}_α) to model worst/best-case utilities. The paper provides Monte-Carlo and 012-MCMC approximation methods with theoretical guarantees and validates the approach on real datasets, showing that DeRDaVa favors staying, non-redundant, and high-quality data while avoiding recomputation after deletions. Overall, DeRDaVa offers a scalable, principled, and deletion-aware data valuation framework with practical applicability for regulated and privacy-conscious ML deployments.
Abstract
Data valuation is concerned with determining a fair valuation of data from data sources to compensate them or to identify training examples that are the most or least useful for predictions. With the rising interest in personal data ownership and data protection regulations, model owners will likely have to fulfil more data deletion requests. This raises issues that have not been addressed by existing works: Are the data valuation scores still fair with deletions? Must the scores be expensively recomputed? The answer is no. To avoid recomputations, we propose using our data valuation framework DeRDaVa upfront for valuing each data source's contribution to preserving robust model performance after anticipated data deletions. DeRDaVa can be efficiently approximated and will assign higher values to data that are more useful or less likely to be deleted. We further generalize DeRDaVa to Risk-DeRDaVa to cater to risk-averse/seeking model owners who are concerned with the worst/best-cases model utility. We also empirically demonstrate the practicality of our solutions.
