Private Mean Estimation with Person-Level Differential Privacy
Sushant Agarwal, Gautam Kamath, Mahbod Majid, Argyris Mouzakis, Rose Silver, Jonathan Ullman
TL;DR
This paper tackles private mean estimation under person-level differential privacy when each person holds multiple samples. It introduces a robust clip-and-noise framework extended to high dimensions via Threaded Clip-and-Noise, and provides tight upper and lower bounds on the number of people needed to estimate the mean within distance α under DP, scaling with dimension d, per-person sample size m, privacy ε, and the moment bound k. The core contributions include tight univariate and multivariate bounds, a novel high-dimensional tail bound for averages of bounded-moment vectors, and a coarse-to-fine strategy that leverages private histograms and iterative refinement to achieve near-optimal sample complexity under approximate-DP, with additional pure-DP results that are computationally harder. The results have implications for federated and privacy-preserving data analysis where individuals contribute multiple data points, clarifying how privacy budgets and tail behavior interact to set feasible data requirements. Overall, the work advances practical private mean estimation with heavy-tailed data, providing both efficient approximate-DP procedures and fundamental limits under DP.
Abstract
We study person-level differentially private (DP) mean estimation in the case where each person holds multiple samples. DP here requires the usual notion of distributional stability when $\textit{all}$ of a person's datapoints can be modified. Informally, if $n$ people each have $m$ samples from an unknown $d$-dimensional distribution with bounded $k$-th moments, we show that \[n = \tilde Θ\left(\frac{d}{α^2 m} + \frac{d}{αm^{1/2} \varepsilon} + \frac{d}{α^{k/(k-1)} m \varepsilon} + \frac{d}{\varepsilon}\right)\] people are necessary and sufficient to estimate the mean up to distance $α$ in $\ell_2$-norm under $\varepsilon$-differential privacy (and its common relaxations). In the multivariate setting, we give computationally efficient algorithms under approximate-DP and computationally inefficient algorithms under pure DP, and our nearly matching lower bounds hold for the most permissive case of approximate DP. Our computationally efficient estimators are based on the standard clip-and-noise framework, but the analysis for our setting requires both new algorithmic techniques and new analyses. In particular, our new bounds on the tails of sums of independent, vector-valued, bounded-moments random variables may be of interest.
