On Differentially Private U Statistics
Kamalika Chaudhuri, Po-Ling Loh, Shourya Pandey, Purnamrita Sarkar
TL;DR
This work tackles the problem of privately estimating θ = $\mathbb{E}[h(X_1,\dots,X_k)]$ for i.i.d. data using U-statistics under central differential privacy. It identifies limitations of off-the-shelf private mean-estimation methods and introduces a thresholding approach based on local Hájek projections to reweight U-statistic subsets, achieving nearly optimal private error for non-degenerate sub-Gaussian kernels and strong indications of near-optimality for degenerate cases. The authors provide matching lower bounds for non-degenerate kernels and near-optimality evidence for degenerate kernels, along with a subsampling variant that reduces runtime to $O(n^2)$ without sacrificing privacy guarantees. They demonstrate applications to private hypothesis testing and sparse-graph statistics (e.g., uniformity testing, triangle densities), illustrating practical impact in settings where U-statistics naturally arise. Overall, the paper significantly advances private U-statistics by combining a novel Hájek-projection-based reweighting scheme with robust private-mean machinery, yielding practical, near-optimal privacy-utility trade-offs.
Abstract
We consider the problem of privately estimating a parameter $\mathbb{E}[h(X_1,\dots,X_k)]$, where $X_1$, $X_2$, $\dots$, $X_k$ are i.i.d. data from some distribution and $h$ is a permutation-invariant function. Without privacy constraints, standard estimators are U-statistics, which commonly arise in a wide range of problems, including nonparametric signed rank tests, symmetry testing, uniformity testing, and subgraph counts in random networks, and can be shown to be minimum variance unbiased estimators under mild conditions. Despite the recent outpouring of interest in private mean estimation, privatizing U-statistics has received little attention. While existing private mean estimation algorithms can be applied to obtain confidence intervals, we show that they can lead to suboptimal private error, e.g., constant-factor inflation in the leading term, or even $Θ(1/n)$ rather than $O(1/n^2)$ in degenerate settings. To remedy this, we propose a new thresholding-based approach using \emph{local Hájek projections} to reweight different subsets of the data. This leads to nearly optimal private error for non-degenerate U-statistics and a strong indication of near-optimality for degenerate U-statistics.
