Table of Contents
Fetching ...

Simulation-based Bayesian Inference from Privacy Protected Data

Yifei Xiong, Nianqiao Phyllis Ju, Sanguo Zhang

TL;DR

This work addresses the challenge of performing valid statistical inference when only differentially private outputs are available. It proposes a trio of likelihood-free approaches—SMC-ABC, sequential private posterior estimation (SPPE), and sequential private likelihood estimation (SPLE)—supplemented by neural density estimators (notably normalizing flows) and variance-reducing randomized quasi-Monte Carlo (RQMC) to learn from privatized data. The methods are demonstrated on a DP-privatized SIR disease-spread model and Bayesian linear regression, showing that SPPE/SPLE achieve comparable accuracy to SMC-ABC with substantially fewer simulations, while correcting for biases introduced by privacy mechanisms. Overall, the framework enables reliable inference and uncertainty quantification from privacy-protected data, offering a path toward privacy-preserving data sharing and analysis with complex, intractable likelihoods.

Abstract

Many modern statistical analysis and machine learning applications require training models on sensitive user data. Under a formal definition of privacy protection, differentially private algorithms inject calibrated noise into the confidential data or during the data analysis process to produce privacy-protected datasets or queries. However, restricting access to only privatized data during statistical analysis makes it computationally challenging to make valid statistical inferences. In this work, we propose simulation-based inference methods from privacy-protected datasets. In addition to sequential Monte Carlo approximate Bayesian computation, we adopt neural conditional density estimators as a flexible family of distributions to approximate the posterior distribution of model parameters given the observed private query results. We illustrate our methods on discrete time-series data under an infectious disease model and with ordinary linear regression models. Illustrating the privacy-utility trade-off, our experiments and analysis demonstrate the necessity and feasibility of designing valid statistical inference procedures to correct for biases introduced by the privacy-protection mechanisms.

Simulation-based Bayesian Inference from Privacy Protected Data

TL;DR

This work addresses the challenge of performing valid statistical inference when only differentially private outputs are available. It proposes a trio of likelihood-free approaches—SMC-ABC, sequential private posterior estimation (SPPE), and sequential private likelihood estimation (SPLE)—supplemented by neural density estimators (notably normalizing flows) and variance-reducing randomized quasi-Monte Carlo (RQMC) to learn from privatized data. The methods are demonstrated on a DP-privatized SIR disease-spread model and Bayesian linear regression, showing that SPPE/SPLE achieve comparable accuracy to SMC-ABC with substantially fewer simulations, while correcting for biases introduced by privacy mechanisms. Overall, the framework enables reliable inference and uncertainty quantification from privacy-protected data, offering a path toward privacy-preserving data sharing and analysis with complex, intractable likelihoods.

Abstract

Many modern statistical analysis and machine learning applications require training models on sensitive user data. Under a formal definition of privacy protection, differentially private algorithms inject calibrated noise into the confidential data or during the data analysis process to produce privacy-protected datasets or queries. However, restricting access to only privatized data during statistical analysis makes it computationally challenging to make valid statistical inferences. In this work, we propose simulation-based inference methods from privacy-protected datasets. In addition to sequential Monte Carlo approximate Bayesian computation, we adopt neural conditional density estimators as a flexible family of distributions to approximate the posterior distribution of model parameters given the observed private query results. We illustrate our methods on discrete time-series data under an infectious disease model and with ordinary linear regression models. Illustrating the privacy-utility trade-off, our experiments and analysis demonstrate the necessity and feasibility of designing valid statistical inference procedures to correct for biases introduced by the privacy-protection mechanisms.
Paper Structure (49 sections, 5 theorems, 43 equations, 15 figures, 4 tables, 3 algorithms)

This paper contains 49 sections, 5 theorems, 43 equations, 15 figures, 4 tables, 3 algorithms.

Key Result

Proposition 3

For a real-valued query $s: \mathbb{X}^n \to \mathbb{S}$, adding zero-centered Laplace noise with parameter $\Delta_1(s) / \epsilon$ achieves $\epsilon$-DP.

Figures (15)

  • Figure 1: Inference on SIR model. A. Convergence of sequential posterior estimations given DP-protected infection trajectory. Each round entails $N = 1000$ simulations. B. Approximation accuracy by SPPE (orange) and SPLE (red) against the number of rounds, the error bars represent the mean with the upper and lower quartiles over 20 random trials.
  • Figure 2: Inference on real infectious disease outbreaks. A. Visualization of the posterior distribution given private infection curve applied to flu, Ebola [in a) Guinea, b) Liberia, and c) Sierra Leone], and COVID-19 in Clark County, Nevada. All experiments use a privacy level of $\epsilon=10$. B. Mean and 95% credible intervals for $R_0 = \beta/\gamma$ with different methods in each dataset. Grey: SMC-ABC; orange: SPPE; red: SPLE. A non-DP baseline (black stars) is also shown, which is obtained from the solution of the corresponding ordinary differential equations.
  • Figure 3: Kolmogorov-Smirnov test statistics between approximations of posterior marginals, at $\epsilon=10$.
  • Figure 4: Detailed convergence of sequential posterior estimations given DP-protected infection trajectory under the SIR model. Each round entails $N = 1000$ simulations. Orange: SPPE; red: SPLE; grey: SMC-ABC.
  • Figure 5: Approximation accuracy by SMC-ABC on the SIR model against the number of simulations.
  • ...and 10 more figures

Theorems & Definitions (10)

  • Definition 1: $\epsilon$-DP
  • Definition 2: Global sensitivity nissim2007smooth
  • Proposition 3: Laplace mechanism dwork2006calibrating
  • Proposition 4: DP infection trajectory
  • Lemma 5
  • proof
  • Lemma 6
  • proof
  • Proposition 4
  • proof