Inference With Combining Rules From Multiple Differentially Private Synthetic Datasets

Leila Nombo; Anne-Sophie Charest

Inference With Combining Rules From Multiple Differentially Private Synthetic Datasets

Leila Nombo, Anne-Sophie Charest

TL;DR

This work tackles the challenge of statistical inference from multiple differentially private synthetic datasets (DIPS) by adapting combining rules originally developed for missing-data scenarios. It formalizes estimators such as $\hat{Q}=\bar{q}_m$ and variance $T_p=\frac{b_m}{m}+\bar{u}_m$ to quantify both sampling and synthesis variability across $m$ synthetic copies, and it compares alternative variance estimators $T_s$ and $T_{s(PPD)}$ under various DP syntheses. Through extensive simulations using six DP mechanisms (DataSynthesizer, COPULA-SHIRLEY, DPGAN, DP-CTGAN, PATE-GAN, PATE-CTGAN) and a range of privacy levels $\varepsilon$, the study finds that combining rules can yield accurate inference in some settings (notably with COPULA-SHIRLEY, DPGAN, and sometimes PATE-GAN) but not universally, with DataSynthesizer and DP-CTGAN often underperforming. The results provide practical guidance on when combining rules are reliable, emphasize the privacy-utility tradeoffs inherent in generating multiple DP synthetic datasets, and call for further work to understand method-specific conditions under which $T_p$ yields valid variance estimates and coverage.

Abstract

Differential privacy (DP) has been accepted as a rigorous criterion for measuring the privacy protection offered by random mechanisms used to obtain statistics or, as we will study here, synthetic datasets from confidential data. Methods to generate such datasets are increasingly numerous, using varied tools including Bayesian models, deep neural networks and copulas. However, little is still known about how to properly perform statistical inference with these differentially private synthetic (DIPS) datasets. The challenge is for the analyses to take into account the variability from the synthetic data generation in addition to the usual sampling variability. A similar challenge also occurs when missing data is imputed before analysis, and statisticians have developed appropriate inference procedures for this case, which we tend extended to the case of synthetic datasets for privacy. In this work, we study the applicability of these procedures, based on combining rules, to the analysis of DIPS datasets. Our empirical experiments show that the proposed combining rules may offer accurate inference in certain contexts, but not in all cases.

Inference With Combining Rules From Multiple Differentially Private Synthetic Datasets

TL;DR

and variance

to quantify both sampling and synthesis variability across

synthetic copies, and it compares alternative variance estimators

and

under various DP syntheses. Through extensive simulations using six DP mechanisms (DataSynthesizer, COPULA-SHIRLEY, DPGAN, DP-CTGAN, PATE-GAN, PATE-CTGAN) and a range of privacy levels

, the study finds that combining rules can yield accurate inference in some settings (notably with COPULA-SHIRLEY, DPGAN, and sometimes PATE-GAN) but not universally, with DataSynthesizer and DP-CTGAN often underperforming. The results provide practical guidance on when combining rules are reliable, emphasize the privacy-utility tradeoffs inherent in generating multiple DP synthetic datasets, and call for further work to understand method-specific conditions under which

yields valid variance estimates and coverage.

Abstract

Paper Structure (15 sections, 3 theorems, 10 equations, 4 figures, 11 tables)

This paper contains 15 sections, 3 theorems, 10 equations, 4 figures, 11 tables.

Introduction
Differential Privacy
Definition
Generation of differentially private synthetic datasets
Statistical inference from synthetic data with differential privacy
Simulations
Simulation 1: continuous data with normal variables
Simulation setting
Results for the means
Results for slopes for regression
Simulation 2: simulation with continuous data which contains one variable with a highly skewed distribution.
Simulation 3: binary data
Simulation setting
Results for the probability of success
Discussion

Key Result

Theorem 1

Post-processing immunity. Let $\mathcal{K}:\mathcal{D} \to \mathrm{R}$ be a randomized algorithm that satisfies $\varepsilon$-DP. Let $f:\mathrm{R} \to \mathrm{R'}$ be an arbitrary function, independent of $\mathcal{D}$. Then $f \circ \mathcal{K} : \mathcal{D} \to \mathrm{R'}$ also satisfies $\varep

Figures (4)

Figure 1: Components of $T_p$ and densities of $\bar{q}_m$ of mean estimate for the variable $Y_1$ over 1000 replications for simulation 1.
Figure 2: Components of $T_p$, $T_p$ and $V_{mc}$ for estimate of slope for the variable $Y_2$ over 1000 replications for simulation 1.
Figure 3: Mean of components of $T_p$: $B_m/m$ and $\bar{u}_m$ for mean for normal variable $Y_2$ and skewed variable $Y_3$ over 1000 replications for simulation 2.
Figure 4: Densities for estimator $\bar{q}_m$ and mean of components of variance estimate $T_p$: $B_m/m$ and $\bar{u}_m$ for the combining rule estimate based on the sample proportion of variable $Y_1$ for simulation 3.

Theorems & Definitions (5)

Definition 1
Theorem 1
Theorem 2
Theorem 3
Definition 2

Inference With Combining Rules From Multiple Differentially Private Synthetic Datasets

TL;DR

Abstract

Inference With Combining Rules From Multiple Differentially Private Synthetic Datasets

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (5)