Table of Contents
Fetching ...

PLAN: Variance-Aware Private Mean Estimation

Martin Aumüller, Christian Janos Lebeda, Boel Nelson, Rasmus Pagh

TL;DR

PLAN (Private Limit Adapted Noise) addresses the challenge of differentially private mean estimation in high dimensions by exploiting structure in the coordinate-wise variances. It adaptively shapes noise via a coordinate-wise scaling by $\hat{\boldsymbol{\sigma}}^{-1/(p+2)}$, privately estimates a clipping threshold through PrivQuantile, and combines clipping with Gaussian noise to achieve error that scales with $\|\boldsymbol{\sigma}\|_1$ (for $\ell_2$) under $\boldsymbol{\sigma}$-well concentrated distributions. The analysis decomposes the utility into the bias from private centering, clipping error, and noise, yielding bounds such as $\mathbb{E}[\|\tilde{\boldsymbol{\mu}}-\boldsymbol{\mu}\|_2^2] = \tilde{O}(1 + \|\boldsymbol{\sigma}\|_2/\sqrt{n} + \|\boldsymbol{\sigma}\|_1/(n\sqrt{\rho}))$ and a general $\ell_p$ analogue, while remaining competitive even without precise variance estimates. Empirically, PLAN shows improvements over state-of-the-art methods in skewed-variance regimes on both synthetic and real datasets, while remaining robust in less skewed scenarios. The work provides a practical, data-aware alternative to worst-case private mean estimation, with guidance for variance estimation, clipping budgets, and practical parameter choices.

Abstract

Differentially private mean estimation is an important building block in privacy-preserving algorithms for data analysis and machine learning. Though the trade-off between privacy and utility is well understood in the worst case, many datasets exhibit structure that could potentially be exploited to yield better algorithms. In this paper we present $\textit{Private Limit Adapted Noise}$ (PLAN), a family of differentially private algorithms for mean estimation in the setting where inputs are independently sampled from a distribution $\mathcal{D}$ over $\mathbf{R}^d$, with coordinate-wise standard deviations $\boldsymbolσ \in \mathbf{R}^d$. Similar to mean estimation under Mahalanobis distance, PLAN tailors the shape of the noise to the shape of the data, but unlike previous algorithms the privacy budget is spent non-uniformly over the coordinates. Under a concentration assumption on $\mathcal{D}$, we show how to exploit skew in the vector $\boldsymbolσ$, obtaining a (zero-concentrated) differentially private mean estimate with $\ell_2$ error proportional to $\|\boldsymbolσ\|_1$. Previous work has either not taken $\boldsymbolσ$ into account, or measured error in Mahalanobis distance $\unicode{x2013}$ in both cases resulting in $\ell_2$ error proportional to $\sqrt{d}\|\boldsymbolσ\|_2$, which can be up to a factor $\sqrt{d}$ larger. To verify the effectiveness of PLAN, we empirically evaluate accuracy on both synthetic and real world data.

PLAN: Variance-Aware Private Mean Estimation

TL;DR

PLAN (Private Limit Adapted Noise) addresses the challenge of differentially private mean estimation in high dimensions by exploiting structure in the coordinate-wise variances. It adaptively shapes noise via a coordinate-wise scaling by , privately estimates a clipping threshold through PrivQuantile, and combines clipping with Gaussian noise to achieve error that scales with (for ) under -well concentrated distributions. The analysis decomposes the utility into the bias from private centering, clipping error, and noise, yielding bounds such as and a general analogue, while remaining competitive even without precise variance estimates. Empirically, PLAN shows improvements over state-of-the-art methods in skewed-variance regimes on both synthetic and real datasets, while remaining robust in less skewed scenarios. The work provides a practical, data-aware alternative to worst-case private mean estimation, with guidance for variance estimation, clipping budgets, and practical parameter choices.

Abstract

Differentially private mean estimation is an important building block in privacy-preserving algorithms for data analysis and machine learning. Though the trade-off between privacy and utility is well understood in the worst case, many datasets exhibit structure that could potentially be exploited to yield better algorithms. In this paper we present (PLAN), a family of differentially private algorithms for mean estimation in the setting where inputs are independently sampled from a distribution over , with coordinate-wise standard deviations . Similar to mean estimation under Mahalanobis distance, PLAN tailors the shape of the noise to the shape of the data, but unlike previous algorithms the privacy budget is spent non-uniformly over the coordinates. Under a concentration assumption on , we show how to exploit skew in the vector , obtaining a (zero-concentrated) differentially private mean estimate with error proportional to . Previous work has either not taken into account, or measured error in Mahalanobis distance in both cases resulting in error proportional to , which can be up to a factor larger. To verify the effectiveness of PLAN, we empirically evaluate accuracy on both synthetic and real world data.
Paper Structure (54 sections, 23 theorems, 30 equations, 10 figures, 2 tables, 2 algorithms)

This paper contains 54 sections, 23 theorems, 30 equations, 10 figures, 2 tables, 2 algorithms.

Key Result

Theorem 1.1

(simplified version) Suppose $\mathcal{D}$ is $\boldsymbol{\sigma}$-well concentrated and that we know $\boldsymbol{\hat{\sigma}}$ such that $\|\boldsymbol{\sigma} - \boldsymbol{\hat{\sigma}}\|_\infty < \|\boldsymbol{\sigma}\|_1/d$. Then for $n = \tilde{\Omega}\left(\max\left(\sqrt{d/\rho}, \rho^{-1 where $\tilde{O}$ suppresses polylogarithmic factors in $d$, $n$, and a bound on the $\ell_\infty$

Figures (10)

  • Figure 1: Step-by-step illustration of plan (\ref{['alg:our-algorithm']}): (a) Raw data, with statistical mean (yellow star), (b) Recentering (blue) and scaling (orange) corresponding to \ref{['alg:recenter-and-scale']}, (c) Clipping, as determined by \ref{['alg:estimate:clipping']}, (d) Private mean (green cross).
  • Figure 2: Histogram for Kosarak (left) and POS (right). The orange line is the smallest allowed variance according to \ref{['lem:binary:concentrated']} which we clip to.
  • Figure 3: $\ell_2$ error for synthetic Gaussian data when varying (a) dimensions with data without a skew, (b) skewness of the variances, and (c) dimensions for skewed data --- note that we compute error relative to the empirical mean rather than the statistical mean in this experiment as sampling error dominates in this setting. Also notice the different scales on the y-axis.
  • Figure 4: $\ell_2$ error for Gaussian A, Gaussian B, and Gaussian C with $\rho=0.01$ which implies $(\varepsilon, \delta)$-approximate DP with $\varepsilon < 1$ for $\delta \approx 10^{-6}$. To tolerate the rank error, we reuse the parameters from the original experiments and set $n= 100\, 000$, and $d=64$ for (b).
  • Figure 5: (a) Synthetic binary data, varying the ratio of 0s to 1s (b) Kosarak dataset (c) POS dataset
  • ...and 5 more figures

Theorems & Definitions (30)

  • Theorem 1.1
  • Definition 2.1: dwork_calibrating_2006 ($\varepsilon$, $\delta$)-Differential Privacy
  • Definition 2.2: bun_concentrated_2016 zero-Concentrated Differential Privacy (zCDP)
  • Lemma 2.3: bun_concentrated_2016 zCDP to $(\varepsilon, \delta)$-DP conversion
  • Lemma 2.4: bun_concentrated_2016 Composition
  • Lemma 2.5: bun_concentrated_2016 The Gaussian Mechanism
  • Definition 2.6: $\ell_p$ error
  • Definition 2.7: Mahalanobis distance
  • Lemma 2.8: Follows from huang_instance-optimal_2021
  • Lemma 2.9
  • ...and 20 more