Table of Contents
Fetching ...

Private Means and the Curious Incident of the Free Lunch

Jack Fitzsimons, James Honaker, Michael Shoemate, Vikrant Singhal

TL;DR

The paper tackles private mean estimation under differential privacy when dataset size is unknown. It introduces the simplex augmentation transformation, which maps each datum to a two-dimensional point on a simplex, allowing the simultaneous private release of two sums with a shared privacy budget. From these, a private count (dataset size) can be recovered for free through post-processing, and the approach extends to weighted means; additional budget can further refine the count via inverse-variance weighting. Empirical results show the simplex method consistently achieves lower variance than standard DP approaches (plugin, centered mean, resize) across multiple distributions and privacy settings, highlighting a practical advance for DP deployments. Overall, the method leverages already-budgeted sensitivity to extract extra information without increasing privacy loss, improving the accuracy of private mean estimates in the unknown-size regime.

Abstract

We show that the most well-known and fundamental building blocks of DP implementations -- sum, mean, count (and many other linear queries) -- can be released with substantially reduced noise for the same privacy guarantee. We achieve this by projecting individual data with worst-case sensitivity $R$ onto a simplex where all data now has a constant norm $R$. In this simplex, additional ``free'' queries can be run that are already covered by the privacy-loss of the original budgeted query, and which algebraically give additional estimates of counts or sums.

Private Means and the Curious Incident of the Free Lunch

TL;DR

The paper tackles private mean estimation under differential privacy when dataset size is unknown. It introduces the simplex augmentation transformation, which maps each datum to a two-dimensional point on a simplex, allowing the simultaneous private release of two sums with a shared privacy budget. From these, a private count (dataset size) can be recovered for free through post-processing, and the approach extends to weighted means; additional budget can further refine the count via inverse-variance weighting. Empirical results show the simplex method consistently achieves lower variance than standard DP approaches (plugin, centered mean, resize) across multiple distributions and privacy settings, highlighting a practical advance for DP deployments. Overall, the method leverages already-budgeted sensitivity to extract extra information without increasing privacy loss, improving the accuracy of private mean estimates in the unknown-size regime.

Abstract

We show that the most well-known and fundamental building blocks of DP implementations -- sum, mean, count (and many other linear queries) -- can be released with substantially reduced noise for the same privacy guarantee. We achieve this by projecting individual data with worst-case sensitivity onto a simplex where all data now has a constant norm . In this simplex, additional ``free'' queries can be run that are already covered by the privacy-loss of the original budgeted query, and which algebraically give additional estimates of counts or sums.
Paper Structure (16 sections, 3 theorems, 15 equations, 2 figures, 1 table, 1 algorithm)

This paper contains 16 sections, 3 theorems, 15 equations, 2 figures, 1 table, 1 algorithm.

Key Result

Lemma 1

If $M : \mathcal{X}^n \to \mathcal{Y}$ is $(\varepsilon,\delta)$-DP (or $\rho$-zCDP) and $P : \mathcal{Y} \to \mathcal{Z}$ is any randomized function, then the algorithm $P \circ M$ is $(\varepsilon,\delta)$-DP (or $\rho$-zCDP).

Figures (2)

  • Figure 2: Simulations of noise distribution for different mean release algorithms with same privacy-loss guarantee. Our simplex method has uniformly lowest variance across data distributions. Here we show reduced variance in the release for means of 100 uniformly random data points in $[0, 100]$ using $\rho = 0.5$. See Appendix \ref{['a:empirics']} for more details and further analysis.
  • Figure 3: Estimator performance across three different distributions (displayed in descending order): Log Normal (location 0, scale 0), Normal (mean 0, variance 1), and Uniform (in range 0, 100), each with 100 randomly generated data points. Empirical probability density functions and complimentary cumulative density functions for the Gaussian mechanism ($\rho=0.5$) and Laplacian mechanism ($\varepsilon=0.5$) displayed. The simplex estimator performs the most accurate in every case.

Theorems & Definitions (7)

  • Definition 1: Differential Privacy (DP) DworkMNS06
  • Definition 2: Concentrated Differential Privacy (zCDP) BunS16
  • Lemma 1: Post-Processing DworkMNS06BunS16
  • Definition 3: $\ell_2$-Sensitivity
  • Lemma 2: Gaussian Mechanism
  • Theorem 3
  • proof