Table of Contents
Fetching ...

Simulation-based, Finite-sample Inference for Privatized Data

Jordan Awan, Zhanyu Wang

TL;DR

It is shown that this methodology is applicable to a wide variety of private inference problems, appropriately accounts for biases introduced by privacy mechanisms, and improves over other state-of-the-art inference methods such as the parametric bootstrap in terms of the coverage and Type I error of the private inference.

Abstract

Privacy protection methods, such as differentially private mechanisms, introduce noise into resulting statistics which often produces complex and intractable sampling distributions. In this paper, we propose a simulation-based "repro sample" approach to produce statistically valid confidence intervals and hypothesis tests, which builds on the work of Xie and Wang (2022). We show that this methodology is applicable to a wide variety of private inference problems, appropriately accounts for biases introduced by privacy mechanisms (such as by clamping), and improves over other state-of-the-art inference methods such as the parametric bootstrap in terms of the coverage and type I error of the private inference. We also develop significant improvements and extensions for the repro sample methodology for general models (not necessarily related to privacy), including 1) modifying the procedure to ensure guaranteed coverage and type I errors, even accounting for Monte Carlo error, and 2) proposing efficient numerical algorithms to implement the confidence intervals and $p$-values.

Simulation-based, Finite-sample Inference for Privatized Data

TL;DR

It is shown that this methodology is applicable to a wide variety of private inference problems, appropriately accounts for biases introduced by privacy mechanisms, and improves over other state-of-the-art inference methods such as the parametric bootstrap in terms of the coverage and Type I error of the private inference.

Abstract

Privacy protection methods, such as differentially private mechanisms, introduce noise into resulting statistics which often produces complex and intractable sampling distributions. In this paper, we propose a simulation-based "repro sample" approach to produce statistically valid confidence intervals and hypothesis tests, which builds on the work of Xie and Wang (2022). We show that this methodology is applicable to a wide variety of private inference problems, appropriately accounts for biases introduced by privacy mechanisms (such as by clamping), and improves over other state-of-the-art inference methods such as the parametric bootstrap in terms of the coverage and type I error of the private inference. We also develop significant improvements and extensions for the repro sample methodology for general models (not necessarily related to privacy), including 1) modifying the procedure to ensure guaranteed coverage and type I errors, even accounting for Monte Carlo error, and 2) proposing efficient numerical algorithms to implement the confidence intervals and -values.
Paper Structure (24 sections, 9 theorems, 26 equations, 10 figures, 9 tables, 7 algorithms)

This paper contains 24 sections, 9 theorems, 26 equations, 10 figures, 9 tables, 7 algorithms.

Key Result

Lemma 3.0

Let $\alpha \in (0,1)$ be given. Let $\theta^*\in \Theta$ be the unknown parameter, $s$ be the observed sample where $s\sim F_{\theta^*}$, and $\omega \sim Q$ be a random variable which is independent of $s$. For any fixed $\theta$, let $B_\alpha(\theta;s,\omega)$ be an event, which depends on $s$ a Then is a $(1-\alpha)$-confidence set for $\theta^*$. If $\theta = (\theta_1, \ldots, \theta_k)$ c

Figures (10)

  • Figure 1: An illustration for $B_\alpha(\theta;s,\omega)$ and $\Gamma_\alpha(s,\omega)$ in Lemma \ref{['lem:confidence']}. The left subfigure is the space of parameter $\theta\in\mathbb{R}^2$, and the right subfigure is the space of $s\in\mathbb{R}^2$. The true parameter is $\theta^*$, and the observed sample is $s^*$, where both are denoted by $\bigstar$. For each $\theta$, we can construct $B_\alpha(\theta;s,\omega)$ (e.g., by simulation as in Theorem \ref{['thm:construction']}: the points in the right subfigure correspond to $\{s_i\}_{i=1}^R$), and we obtain $\Gamma_\alpha(s,\omega)$ by collecting the $\theta$ such that its corresponding $\{s|I(B_\alpha(\theta;s,\omega)) = 1\}$ contains $s^*$. For example, in the left subfigure, the parameters denoted by the red $+$ and blue $\circ$ are in $\Gamma_\alpha(s,\omega)$ while the one by green $\times$ is not. Note that $\Gamma_\alpha(s,\omega)$ is a valid confidence set with level $1-\alpha$ since for every $\theta \in \Theta$, we have $s \sim F_\theta$ being in $\{s|I(B_\alpha(\theta-;s,\omega)) = 1\}$ with probability $1-\alpha$, which means $P(\theta\in \Gamma_\alpha(s,\omega)) \geq 1-\alpha$. Furthermore, the confidence set $\Gamma_\alpha^{\theta_1}(s,\omega)$ is shown in the left subfigure, illustrating that it is the projection of $\Gamma_\alpha(s,\omega)$ onto the component $\theta_1$.
  • Figure 2: $95\%$ confidence set for location-scale normal (Example \ref{['ex:normal1']} and Section \ref{['s:normal']}), based on $s = (1, 0.75)$, generated using $n=100$, $\theta^*=(\mu^*,\sigma^*)=(1,1)$, $\epsilon=1$, $U=3$, $L=0$, and $R=200$ repro samples. From left to right: Mahalanobis depth (area 0.35), Halfspace/Tukey depth (area 0.61), Simplicial depth (area 0.63), Spatial depth (area 0.36).
  • Figure 3: (a) Illustration of Algorithm \ref{['alg:CI']}. (b) Illustration of Algorithm \ref{['alg:ConfGrid']}. See Example \ref{['ex:illustration']} for details.
  • Figure 4: For fixed seeds in data generation, and in simulation-based inference, confidence intervals for $x_1,\ldots, x_{n} \sim \mathrm{Pois}(10)$ based on $s = \frac{1}{n} \sum_{i=1}^n [x_i]_0^c+(c/(n\varepsilon))N$, where $n=100$, $R=1000$, $N\sim N(0,1)$ and $\varepsilon=1$. Left: $95\%$ confidence intervals as $c$ varies. Right: width of the $95\%$ confidence intervals as $c$ varies. For $c\leq 4$, the upper confidence limit is $\infty$.
  • Figure 5: The rejection probability for hypothesis testing on $H_0:\beta_1^*=0$ and $H_1:\beta_1^*\neq 0$ in a linear regression model $Y=\beta_0^*+X\beta_1^* + \epsilon$ with repro and parametric bootstrap alabi2022hypothesis. The significance level is 0.05, and the values in the table are calculated from 1000 replicates.
  • ...and 5 more figures

Theorems & Definitions (37)

  • Example 2.1: Bernoulli example
  • Definition 2.2: Differential privacy: dwork2006calibrating
  • Definition 2.3: Gaussian DP: dong2022gaussian
  • Example 2.4: Additive noise mechanism
  • Lemma 3.0
  • Remark 3.1
  • Example 3.2
  • Example 3.3: Bernoulli distribution: awan2018differentially
  • Remark 3.4
  • Theorem 3.5
  • ...and 27 more