On Differentially Private U Statistics

Kamalika Chaudhuri; Po-Ling Loh; Shourya Pandey; Purnamrita Sarkar

On Differentially Private U Statistics

Kamalika Chaudhuri, Po-Ling Loh, Shourya Pandey, Purnamrita Sarkar

TL;DR

This work tackles the problem of privately estimating θ = $\mathbb{E}[h(X_1,\dots,X_k)]$ for i.i.d. data using U-statistics under central differential privacy. It identifies limitations of off-the-shelf private mean-estimation methods and introduces a thresholding approach based on local Hájek projections to reweight U-statistic subsets, achieving nearly optimal private error for non-degenerate sub-Gaussian kernels and strong indications of near-optimality for degenerate cases. The authors provide matching lower bounds for non-degenerate kernels and near-optimality evidence for degenerate kernels, along with a subsampling variant that reduces runtime to $O(n^2)$ without sacrificing privacy guarantees. They demonstrate applications to private hypothesis testing and sparse-graph statistics (e.g., uniformity testing, triangle densities), illustrating practical impact in settings where U-statistics naturally arise. Overall, the paper significantly advances private U-statistics by combining a novel Hájek-projection-based reweighting scheme with robust private-mean machinery, yielding practical, near-optimal privacy-utility trade-offs.

Abstract

We consider the problem of privately estimating a parameter $\mathbb{E}[h(X_1,\dots,X_k)]$, where $X_1$, $X_2$, $\dots$, $X_k$ are i.i.d. data from some distribution and $h$ is a permutation-invariant function. Without privacy constraints, standard estimators are U-statistics, which commonly arise in a wide range of problems, including nonparametric signed rank tests, symmetry testing, uniformity testing, and subgraph counts in random networks, and can be shown to be minimum variance unbiased estimators under mild conditions. Despite the recent outpouring of interest in private mean estimation, privatizing U-statistics has received little attention. While existing private mean estimation algorithms can be applied to obtain confidence intervals, we show that they can lead to suboptimal private error, e.g., constant-factor inflation in the leading term, or even $Θ(1/n)$ rather than $O(1/n^2)$ in degenerate settings. To remedy this, we propose a new thresholding-based approach using \emph{local Hájek projections} to reweight different subsets of the data. This leads to nearly optimal private error for non-degenerate U-statistics and a strong indication of near-optimality for degenerate U-statistics.

On Differentially Private U Statistics

TL;DR

This work tackles the problem of privately estimating θ =

for i.i.d. data using U-statistics under central differential privacy. It identifies limitations of off-the-shelf private mean-estimation methods and introduces a thresholding approach based on local Hájek projections to reweight U-statistic subsets, achieving nearly optimal private error for non-degenerate sub-Gaussian kernels and strong indications of near-optimality for degenerate cases. The authors provide matching lower bounds for non-degenerate kernels and near-optimality evidence for degenerate kernels, along with a subsampling variant that reduces runtime to

without sacrificing privacy guarantees. They demonstrate applications to private hypothesis testing and sparse-graph statistics (e.g., uniformity testing, triangle densities), illustrating practical impact in settings where U-statistics naturally arise. Overall, the paper significantly advances private U-statistics by combining a novel Hájek-projection-based reweighting scheme with robust private-mean machinery, yielding practical, near-optimal privacy-utility trade-offs.

Abstract

We consider the problem of privately estimating a parameter

, where

are i.i.d. data from some distribution and

is a permutation-invariant function. Without privacy constraints, standard estimators are U-statistics, which commonly arise in a wide range of problems, including nonparametric signed rank tests, symmetry testing, uniformity testing, and subgraph counts in random networks, and can be shown to be minimum variance unbiased estimators under mild conditions. Despite the recent outpouring of interest in private mean estimation, privatizing U-statistics has received little attention. While existing private mean estimation algorithms can be applied to obtain confidence intervals, we show that they can lead to suboptimal private error, e.g., constant-factor inflation in the leading term, or even

rather than

in degenerate settings. To remedy this, we propose a new thresholding-based approach using \emph{local Hájek projections} to reweight different subsets of the data. This leads to nearly optimal private error for non-degenerate U-statistics and a strong indication of near-optimality for degenerate U-statistics.

Paper Structure (38 sections, 37 theorems, 181 equations, 2 figures, 1 table, 7 algorithms)

This paper contains 38 sections, 37 theorems, 181 equations, 2 figures, 1 table, 7 algorithms.

Introduction
Background and problem setup
U-Statistics
Differential privacy
Private mean estimation
Lower bounds and application of off-the-shelf tools
Adaptations of the CoinPress algorithm for private estimation
Setting.
Lower bound for non-degenerate kernels
Main results
Key intuition
Proposed algorithm
Application to different types of kernels
Lower bound
Subsampling estimator
...and 23 more sections

Key Result

Lemma 1

Let $\mathcal{A}_i : \mathcal{X}^n \times \prod_{j=1}^{i-1} \mathcal{Y}_i \to \mathcal{Y}_i$ for $i \in [k]$ be $k$ randomized algorithms such that for any $i \in [k]$ and any $(y_1, y_2, \dots, y_{i-1}) \in \prod_{j=1}^{i-1} \mathcal{Y}_j$, the algorithm $\mathcal{A}_i(\cdot, y_1, y_2, \dots, y_{i- where $y_i = \mathcal{A}_i(D, y_1, \dots, y_{i-1})$ for all $i \in [k]$, is $\sum_{i=1}^k \epsilon_

Figures (2)

Figure 1: Conditional probability of a triangle given two vertices are $r_n$ distance away.
Figure A.1: Weighting scheme in Eq \ref{['eq:weight']}

Theorems & Definitions (67)

Lemma 1: Basic composition
Lemma 2: Parallel composition
Lemma 3: Global sensitivity mechanism dwork2006calibrating
Lemma 4: Smoothed sensitivity mechanism nissim2007smooth
Definition 1
Proposition 1
Remark 1
Definition 2: All-tuples family
Definition 3: Subsampled Family
Proposition 2
...and 57 more

On Differentially Private U Statistics

TL;DR

Abstract

On Differentially Private U Statistics

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (67)