Table of Contents
Fetching ...

Does Differentially Private Synthetic Data Lead to Synthetic Discoveries?

Ileana Montoya Perez, Parisa Movahedi, Valtteri Nieminen, Antti Airola, Tapio Pahikkala

TL;DR

A large portion of the evaluation results expressed dramatically inflated Type I errors, especially at levels of ϵ  ≤ 1, calling for caution when releasing and analyzing DP-synthetic data: low p -values may be obtained in statistical tests simply as a byproduct of the noise added to protect privacy.

Abstract

Background: Synthetic data has been proposed as a solution for sharing anonymized versions of sensitive biomedical datasets. Ideally, synthetic data should preserve the structure and statistical properties of the original data, while protecting the privacy of the individual subjects. Differential privacy (DP) is currently considered the gold standard approach for balancing this trade-off. Objectives: To investigate the reliability of group differences identified by independent sample tests on DP-synthetic data. The evaluation is conducted in terms of the tests' Type I and Type II errors. The former quantifies the tests' validity i.e. whether the probability of false discoveries is indeed below the significance level, and the latter indicates the tests' power in making real discoveries. Methods: We evaluate the Mann-Whitney U test, Student's t-test, chi-squared test and median test on DP-synthetic data. The private synthetic datasets are generated from real-world data, including a prostate cancer dataset (n=500) and a cardiovascular dataset (n=70 000), as well as on bivariate and multivariate simulated data. Five different DP-synthetic data generation methods are evaluated, including two basic DP histogram release methods and MWEM, Private-PGM, and DP GAN algorithms. Conclusion: A large portion of the evaluation results expressed dramatically inflated Type I errors, especially at privacy budget levels of $ε\leq 1$. This result calls for caution when releasing and analyzing DP-synthetic data: low p-values may be obtained in statistical tests simply as a byproduct of the noise added to protect privacy. A DP smoothed histogram-based synthetic data generation method was shown to produce valid Type I error for all privacy levels tested but required a large original dataset size and a modest privacy budget ($ε\geq 5$) in order to have reasonable Type II error.

Does Differentially Private Synthetic Data Lead to Synthetic Discoveries?

TL;DR

A large portion of the evaluation results expressed dramatically inflated Type I errors, especially at levels of ϵ  ≤ 1, calling for caution when releasing and analyzing DP-synthetic data: low p -values may be obtained in statistical tests simply as a byproduct of the noise added to protect privacy.

Abstract

Background: Synthetic data has been proposed as a solution for sharing anonymized versions of sensitive biomedical datasets. Ideally, synthetic data should preserve the structure and statistical properties of the original data, while protecting the privacy of the individual subjects. Differential privacy (DP) is currently considered the gold standard approach for balancing this trade-off. Objectives: To investigate the reliability of group differences identified by independent sample tests on DP-synthetic data. The evaluation is conducted in terms of the tests' Type I and Type II errors. The former quantifies the tests' validity i.e. whether the probability of false discoveries is indeed below the significance level, and the latter indicates the tests' power in making real discoveries. Methods: We evaluate the Mann-Whitney U test, Student's t-test, chi-squared test and median test on DP-synthetic data. The private synthetic datasets are generated from real-world data, including a prostate cancer dataset (n=500) and a cardiovascular dataset (n=70 000), as well as on bivariate and multivariate simulated data. Five different DP-synthetic data generation methods are evaluated, including two basic DP histogram release methods and MWEM, Private-PGM, and DP GAN algorithms. Conclusion: A large portion of the evaluation results expressed dramatically inflated Type I errors, especially at privacy budget levels of . This result calls for caution when releasing and analyzing DP-synthetic data: low p-values may be obtained in statistical tests simply as a byproduct of the noise added to protect privacy. A DP smoothed histogram-based synthetic data generation method was shown to produce valid Type I error for all privacy levels tested but required a large original dataset size and a modest privacy budget () in order to have reasonable Type II error.
Paper Structure (22 sections, 5 equations, 10 figures, 3 tables, 1 algorithm)

This paper contains 22 sections, 5 equations, 10 figures, 3 tables, 1 algorithm.

Figures (10)

  • Figure 1: The overall configuration of the study.
  • Figure 2: Possible outcomes of a hypothesis test that tests whether two distributions are the same. TN: true negative, TP: true positive, FP: false positive (Type I error), FN: false negative (Type II error).
  • Figure 3: a) Prostate cancer (PCa) dataset: prostate-specific antigen (PSA) level distribution for high-risk and benign/low PCa. The difference between the groups is statistically significant (MW U stat= 22713, p-value= 1.4e-07), b) Kaggle Cardiovascular disease dataset: body mass index (BMI) distribution for subjects with the absence and presence of cardio disease. The difference between the groups is statistically significant (MW U stat= 471500929.50, p-value $\cong$ 0.000).
  • Figure 4: The proportion of Type I and Type II errors for the Mann-Whitney U test using four differentially private (DP) methods: DP-MW U test, DP Perturbed Histogram, Private-PGM, and MWEM at different privacy budget ($\epsilon$). The dataset size indicates the size of the original data used in the experiments by the DP methods. The proportions of Type I error and Type II error were measured over 1000 repetitions of the experiment using Gaussian a) non-signal data and b) signal data, respectively.
  • Figure 5: The proportion of Type I and Type II error of MW U test applied to synthetic data generated from DP Smoothed Histogram and DP GAN. The size of the original dataset is 20 000 with a group ratio of 50%. DP-synthetic data of sizes 50, 100, 500, and 1000 were generated from both methods. The proportions of Type I error and Type II error were measured over 1000 DP-synthetic datasets using Gaussian a) non-signal data and b) signal data, respectively.
  • ...and 5 more figures