Table of Contents
Fetching ...

Differentially Private Data Generation with Missing Data

Shubhankar Mohapatra, Jianqiao Zong, Florian Kerschbaum, Xi He

TL;DR

This work addresses generating synthetic data under differential privacy when the input data contains missing values. It formalizes two DP problem variants—privacy for the incomplete data $D$ and privacy for the ground-truth data $ar{D}$—and models the missing mechanism $M_oldsymbol{\Phi}$ as a sampling process to derive tighter privacy bounds, including amplification effects under MCAR. The authors propose three adaptive recourse strategies (DP-MisGAN for GAN-based generation, PrivBayesE for partial-mMarginal observations, and KaminoI for column-wise generation) that integrate missing-data handling into the learning process without extra privacy budget, improving utility by up to 15–72% on four real datasets. They also analyze how missingness can amplify privacy for the ground truth data, yielding bounds like $0.1$–$0.65 imes$ the incomplete-data privacy under certain MCAR conditions, and provide extensive empirical evidence of utility gains across MAR/MNAR and MCAR settings. Overall, the work advances understanding of private synthetic data generation in the presence of missing data and offers practical adaptive methods and privacy- amplification results to guide real-world deployments.

Abstract

Despite several works that succeed in generating synthetic data with differential privacy (DP) guarantees, they are inadequate for generating high-quality synthetic data when the input data has missing values. In this work, we formalize the problems of DP synthetic data with missing values and propose three effective adaptive strategies that significantly improve the utility of the synthetic data on four real-world datasets with different types and levels of missing data and privacy requirements. We also identify the relationship between privacy impact for the complete ground truth data and incomplete data for these DP synthetic data generation algorithms. We model the missing mechanisms as a sampling process to obtain tighter upper bounds for the privacy guarantees to the ground truth data. Overall, this study contributes to a better understanding of the challenges and opportunities for using private synthetic data generation algorithms in the presence of missing data.

Differentially Private Data Generation with Missing Data

TL;DR

This work addresses generating synthetic data under differential privacy when the input data contains missing values. It formalizes two DP problem variants—privacy for the incomplete data and privacy for the ground-truth data —and models the missing mechanism as a sampling process to derive tighter privacy bounds, including amplification effects under MCAR. The authors propose three adaptive recourse strategies (DP-MisGAN for GAN-based generation, PrivBayesE for partial-mMarginal observations, and KaminoI for column-wise generation) that integrate missing-data handling into the learning process without extra privacy budget, improving utility by up to 15–72% on four real datasets. They also analyze how missingness can amplify privacy for the ground truth data, yielding bounds like the incomplete-data privacy under certain MCAR conditions, and provide extensive empirical evidence of utility gains across MAR/MNAR and MCAR settings. Overall, the work advances understanding of private synthetic data generation in the presence of missing data and offers practical adaptive methods and privacy- amplification results to guide real-world deployments.

Abstract

Despite several works that succeed in generating synthetic data with differential privacy (DP) guarantees, they are inadequate for generating high-quality synthetic data when the input data has missing values. In this work, we formalize the problems of DP synthetic data with missing values and propose three effective adaptive strategies that significantly improve the utility of the synthetic data on four real-world datasets with different types and levels of missing data and privacy requirements. We also identify the relationship between privacy impact for the complete ground truth data and incomplete data for these DP synthetic data generation algorithms. We model the missing mechanisms as a sampling process to obtain tighter upper bounds for the privacy guarantees to the ground truth data. Overall, this study contributes to a better understanding of the challenges and opportunities for using private synthetic data generation algorithms in the presence of missing data.
Paper Structure (8 sections, 2 theorems, 1 equation, 1 figure)

This paper contains 8 sections, 2 theorems, 1 equation, 1 figure.

Key Result

theorem 1

DBLP:conf/sigmod/McSherry09 We say a transformation $T(\cdot)$ is $c$-stable, if the distance between $T(D)$ and $T(D')$ is at most $c$ times the distance between $D$ and $D'$. The composite mechanism $\mathcal{M} \circ T$ then becomes $(c \cdot \epsilon, \delta)$-DP, for any mechanism $\mathcal{M}$

Figures (1)

  • Figure 1: Complete row only approach results in poor results for MAR and MNAR missing mechanism.

Theorems & Definitions (3)

  • definition 1: Differential Privacy (DP) DBLP:journals/fttcs/DworkR14DBLP:conf/eurocrypt/DworkKMMN06
  • theorem 1
  • lemma 1