Challenges in Enabling Private Data Valuation

Yiwei Fu; Tianhao Wang; Varun Chandrasekaran

Challenges in Enabling Private Data Valuation

Yiwei Fu, Tianhao Wang, Varun Chandrasekaran

TL;DR

This work identifies the core algorithmic primitives across common valuation frameworks that induce prohibitive sensitivity, explaining why straightforward DP mechanisms fail and derives design principles for more privacy-amenable valuation procedures and empirically characterize how privacy constraints degrade ranking fidelity across representative methods and datasets.

Abstract

Data valuation methods quantify how individual training examples contribute to a model's behavior, and are increasingly used for dataset curation, auditing, and emerging data markets. As these techniques become operational, they raise serious privacy concerns: valuation scores can reveal whether a person's data was included in training, whether it was unusually influential, or what sensitive patterns exist in proprietary datasets. This motivates the study of privacy-preserving data valuation. However, privacy is fundamentally in tension with valuation utility under differential privacy (DP). DP requires outputs to be insensitive to any single record, while valuation methods are explicitly designed to measure per-record influence. As a result, naive privatization often destroys the fine-grained distinctions needed to rank or attribute value, particularly in heterogeneous datasets where rare examples exert outsized effects. In this work, we analyze the feasibility of DP-compatible data valuation. We identify the core algorithmic primitives across common valuation frameworks that induce prohibitive sensitivity, explaining why straightforward DP mechanisms fail. We further derive design principles for more privacy-amenable valuation procedures and empirically characterize how privacy constraints degrade ranking fidelity across representative methods and datasets. Our results clarify the limits of current approaches and provide a foundation for developing valuation methods that remain useful under rigorous privacy guarantees.

Challenges in Enabling Private Data Valuation

TL;DR

Abstract

Paper Structure (24 sections, 36 equations, 5 figures, 3 tables)

This paper contains 24 sections, 36 equations, 5 figures, 3 tables.

Introduction
Background
Machine Learning Primer
Threat Model
Methods of Data Valuation
Influence & Curvature Approximations
Privacy driver (curvature amplification).
Weighted Marginal Contributions
Privacy driver (utility instability and coalition extrema).
Trajectory-Aware Approximations
Privacy driver (compositional exposure along the trajectory).
Data Modeling and Linearized Attribution
Privacy driver (feature-space smoothing without certification).
Lessons Learned
Privacy Challenges in \ref{['sec:if']}
...and 9 more sections

Figures (5)

Figure 1: Private valuation is an understudied topic. This plot contains the number of accepted data valuation papers, and privacy-focused data valuation papers, over recent years at top ML conferences (NeurIPS, ICLR, ICML). The papers are first filtered with a keyword search, then processed by Gemini to confirm its relevance to data valuation. While the attention on data valuation has grown, little focus was put on privacy-preserving data valuation.
Figure 2: Spectral distribution of the empirical Hessian. The distribution of eigenvalues $\mu$ for the empirical Hessian $H$ of a CNN trained on MNIST. The high concentration of eigenvalues with magnitude near zero is consistent with the property of flatness in the loss landscape at convergence. This implies that the operator norm $\|H^{-1}\|$ is difficult to bound and potentially ill-defined.
Figure 3: Distribution of influence scores. Influence scores for 100 sampled training data points on a fixed validation point. Most influence scores are centered around zero, but extreme outliers exist due to curvature amplification. The presence of these outliers impacts sensitivity estimation.
Figure 4: Ratio of estimated sensitivity to average score magnitude. Even with tight clipping, the noise required (proportional to sensitivity) overpowers the signal of the average data point. The ratio remains $>1$, indicating that the noise floor exceeds the signal for the majority of the distribution.
Figure 5: ROC curves evaluating mislabel detection performance across different privacy budgets. Results are very similar for both the TracIn and the 2nd-order in-run Shapley methods. With regular training, the valuation methods are very successful in detecting mislabeled data, while DP-SGD degrades the performance by a slight amount. The strength of noise level in DP noise negatively affects mislabel detection performance very slightly.

Theorems & Definitions (1)

Definition 1: Differential Privacy Dwork_Roth_2014

Challenges in Enabling Private Data Valuation

TL;DR

Abstract

Challenges in Enabling Private Data Valuation

Authors

TL;DR

Abstract

Table of Contents

Figures (5)

Theorems & Definitions (1)