Table of Contents
Fetching ...

Noise-Calibrated Inference from Differentially Private Sufficient Statistics in Exponential Families

Amir Asiaee, Samhita Pal

TL;DR

A clean and tractable middle ground for exponential families: release only DP sufficient statistics, then perform noise-calibrated likelihood-based inference and optional parametric synthetic data generation as post-processing.

Abstract

Many differentially private (DP) data release systems either output DP synthetic data and leave analysts to perform inference as usual, which can lead to severe miscalibration, or output a DP point estimate without a principled way to do uncertainty quantification. This paper develops a clean and tractable middle ground for exponential families: release only DP sufficient statistics, then perform noise-calibrated likelihood-based inference and optional parametric synthetic data generation as post-processing. Our contributions are: (1) a general recipe for approximate-DP release of clipped sufficient statistics under the Gaussian mechanism; (2) asymptotic normality, explicit variance inflation, and valid Wald-style confidence intervals for the plug-in DP MLE; (3) a noise-aware likelihood correction that is first-order equivalent to the plug-in but supports bootstrap-based intervals; and (4) a matching minimax lower bound showing the privacy distortion rate is unavoidable. The resulting theory yields concrete design rules and a practical pipeline for releasing DP synthetic data with principled uncertainty quantification, validated on three exponential families and real census data.

Noise-Calibrated Inference from Differentially Private Sufficient Statistics in Exponential Families

TL;DR

A clean and tractable middle ground for exponential families: release only DP sufficient statistics, then perform noise-calibrated likelihood-based inference and optional parametric synthetic data generation as post-processing.

Abstract

Many differentially private (DP) data release systems either output DP synthetic data and leave analysts to perform inference as usual, which can lead to severe miscalibration, or output a DP point estimate without a principled way to do uncertainty quantification. This paper develops a clean and tractable middle ground for exponential families: release only DP sufficient statistics, then perform noise-calibrated likelihood-based inference and optional parametric synthetic data generation as post-processing. Our contributions are: (1) a general recipe for approximate-DP release of clipped sufficient statistics under the Gaussian mechanism; (2) asymptotic normality, explicit variance inflation, and valid Wald-style confidence intervals for the plug-in DP MLE; (3) a noise-aware likelihood correction that is first-order equivalent to the plug-in but supports bootstrap-based intervals; and (4) a matching minimax lower bound showing the privacy distortion rate is unavoidable. The resulting theory yields concrete design rules and a practical pipeline for releasing DP synthetic data with principled uncertainty quantification, validated on three exponential families and real census data.
Paper Structure (58 sections, 5 theorems, 18 equations, 9 figures, 1 table, 1 algorithm)

This paper contains 58 sections, 5 theorems, 18 equations, 9 figures, 1 table, 1 algorithm.

Key Result

Theorem 1

Under Assumption assump:bounded, releasing $\widetilde{\bar{S}}=\bar{S}(D)+Z$ with $Z\sim \mathcal{N}(0,\sigma^2 I_d)$ and $\sigma$ as in Algorithm alg:dp_suffstat is $(\varepsilon,\delta)$-DP. Moreover, any (randomized) post-processing of $\widetilde{\bar{S}}$ (including $\widetilde{\theta}$ and $D

Figures (9)

  • Figure 1: Pipeline overview. The noisy sufficient statistic $\widetilde{\bar{S}}$ is the only DP-protected release; all downstream tasks inherit the same $(\varepsilon,\delta)$-DP guarantee by post-processing.
  • Figure 2: Empirical versus theoretical variance for the DP plug-in estimator in Gaussian mean estimation. Each point corresponds to one $(n,\varepsilon)$ configuration, with color indicating $\varepsilon$ and marker shape indicating $n$. The close alignment with the identity line validates the finite-sample relevance of Theorem \ref{['thm:clt']}.
  • Figure 3: Coverage of 95% intervals across privacy levels for Gaussian, logistic, and Poisson models, each at two sample sizes. The shaded band marks an acceptable calibration range around nominal coverage. Noise-calibrated DP methods remain near nominal while naive synthetic analysis undercovers in the low-$\varepsilon$ regime.
  • Figure 4: Average confidence-interval length versus $\varepsilon$ for the same settings as Figure \ref{['fig:coverage_vs_epsilon']}. Noise-aware methods are wider at strong privacy (small $\varepsilon$) and contract as $\varepsilon$ increases, reflecting the expected privacy-accuracy trade-off. Naive synthetic intervals remain narrow but are miscalibrated.
  • Figure 5: Logistic-regression clipping study comparing DP plug-in and DP noise-aware estimators. Left: average absolute bias versus clipping radius $B$. Right: empirical 95% coverage versus $B$. Both methods exhibit a U-shaped bias curve: too-small $B$ causes clipping bias while too-large $B$ increases noise through higher sensitivity. The noise-aware estimator provides no advantage over plug-in, consistent with Proposition \ref{['prop:equiv']}.
  • ...and 4 more figures

Theorems & Definitions (8)

  • Theorem 1: Privacy of sufficient-statistic release
  • Theorem 2: Asymptotic distribution and variance inflation
  • Corollary 1: When does privacy preserve classical efficiency?
  • Proposition 1: First-order equivalence
  • Theorem 3: Unavoidable $\Omega(1/(n^2\varepsilon^2))$ MSE lower bound
  • proof
  • proof
  • proof