Table of Contents
Fetching ...

Differentially Private Verification of Survey-Weighted Estimates

Tong Lin, Jerome P. Reiter

TL;DR

The paper addresses verifying synthetic-data quality for survey-weighted estimates under complex sampling by introducing a differentially private verification scheme that uses sub-sample and aggregate to compare confidential-data estimates with synthetic-data estimates. It extends prior verification ideas to PPS designs, adding DP noise via $S^R$ and a Bayesian post-processing step to infer the probability $r$ that partition estimates fall within a tolerance around the synthetic result. The main finding is that adjusted tolerance intervals (with inflation by $\gamma=\sqrt{M}$) provide more reliable, privacy-preserving verification than fixed thresholds, across totals and means in both representative and biased synthesis scenarios. Practically, the method offers agencies a way to deliver feedback on synthetic-data quality while controlling disclosure risk, and it highlights the importance of accounting for survey design and choosing the partition parameter $M$ through simulation.

Abstract

Several official statistics agencies release synthetic data as public use microdata files. In practice, synthetic data do not admit accurate results for every analysis. Thus, it is beneficial for agencies to provide users with feedback on the quality of their analyses of the synthetic data. One approach is to couple synthetic data with a verification server that provides users with measures of the similarity of estimates computed with the synthetic and underlying confidential data. However, such measures leak information about the confidential records, so that agencies may wish to apply disclosure control methods to the released verification measures. We present a verification measure that satisfies differential privacy and can be used when the underlying confidential are collected with a complex survey design. We illustrate the verification measure using repeated sampling simulations where the confidential data are sampled with a probability proportional to size design, and the analyst estimates a population total or mean with the synthetic data. The simulations suggest that the verification measures can provide useful information about the quality of synthetic data inferences.

Differentially Private Verification of Survey-Weighted Estimates

TL;DR

The paper addresses verifying synthetic-data quality for survey-weighted estimates under complex sampling by introducing a differentially private verification scheme that uses sub-sample and aggregate to compare confidential-data estimates with synthetic-data estimates. It extends prior verification ideas to PPS designs, adding DP noise via and a Bayesian post-processing step to infer the probability that partition estimates fall within a tolerance around the synthetic result. The main finding is that adjusted tolerance intervals (with inflation by ) provide more reliable, privacy-preserving verification than fixed thresholds, across totals and means in both representative and biased synthesis scenarios. Practically, the method offers agencies a way to deliver feedback on synthetic-data quality while controlling disclosure risk, and it highlights the importance of accounting for survey design and choosing the partition parameter through simulation.

Abstract

Several official statistics agencies release synthetic data as public use microdata files. In practice, synthetic data do not admit accurate results for every analysis. Thus, it is beneficial for agencies to provide users with feedback on the quality of their analyses of the synthetic data. One approach is to couple synthetic data with a verification server that provides users with measures of the similarity of estimates computed with the synthetic and underlying confidential data. However, such measures leak information about the confidential records, so that agencies may wish to apply disclosure control methods to the released verification measures. We present a verification measure that satisfies differential privacy and can be used when the underlying confidential are collected with a complex survey design. We illustrate the verification measure using repeated sampling simulations where the confidential data are sampled with a probability proportional to size design, and the analyst estimates a population total or mean with the synthetic data. The simulations suggest that the verification measures can provide useful information about the quality of synthetic data inferences.
Paper Structure (13 sections, 6 equations, 5 figures)

This paper contains 13 sections, 6 equations, 5 figures.

Figures (5)

  • Figure 1: $r_{full}$ (red points) and posterior medians of $r$ (box plots) using fixed tolerance intervals for the population total. Synthetic data are a SRS from $P$.
  • Figure 2: $r_{full}$ (red points) and posterior medians of $r$ (box plots) using adjusted tolerance intervals for the population total. Synthetic data are a SRS from $P$.
  • Figure 3: $r_{full}$ (red points) and posterior medians of $r$ (box plots) using adjusted tolerance intervals for the population total. Synthetic data are a biased sample.
  • Figure 4: $r_{full}$ (red points) and posterior medians of $r$ (box plots) using adjusted tolerance intervals for the population average. Synthetic data are a SRS from $P$.
  • Figure 5: $r_{full}$ (red points) and posterior medians of $r$ (box plots) using adjusted tolerance intervals for the population average. Synthetic data are a biased sample.