Differentially Private Data Generation with Missing Data

Shubhankar Mohapatra; Jianqiao Zong; Florian Kerschbaum; Xi He

Differentially Private Data Generation with Missing Data

Shubhankar Mohapatra, Jianqiao Zong, Florian Kerschbaum, Xi He

TL;DR

This work addresses generating synthetic data under differential privacy when the input data contains missing values. It formalizes two DP problem variants—privacy for the incomplete data $D$ and privacy for the ground-truth data $ar{D}$—and models the missing mechanism $M_oldsymbol{\Phi}$ as a sampling process to derive tighter privacy bounds, including amplification effects under MCAR. The authors propose three adaptive recourse strategies (DP-MisGAN for GAN-based generation, PrivBayesE for partial-mMarginal observations, and KaminoI for column-wise generation) that integrate missing-data handling into the learning process without extra privacy budget, improving utility by up to 15–72% on four real datasets. They also analyze how missingness can amplify privacy for the ground truth data, yielding bounds like $0.1$–$0.65 imes$ the incomplete-data privacy under certain MCAR conditions, and provide extensive empirical evidence of utility gains across MAR/MNAR and MCAR settings. Overall, the work advances understanding of private synthetic data generation in the presence of missing data and offers practical adaptive methods and privacy- amplification results to guide real-world deployments.

Abstract

Despite several works that succeed in generating synthetic data with differential privacy (DP) guarantees, they are inadequate for generating high-quality synthetic data when the input data has missing values. In this work, we formalize the problems of DP synthetic data with missing values and propose three effective adaptive strategies that significantly improve the utility of the synthetic data on four real-world datasets with different types and levels of missing data and privacy requirements. We also identify the relationship between privacy impact for the complete ground truth data and incomplete data for these DP synthetic data generation algorithms. We model the missing mechanisms as a sampling process to obtain tighter upper bounds for the privacy guarantees to the ground truth data. Overall, this study contributes to a better understanding of the challenges and opportunities for using private synthetic data generation algorithms in the presence of missing data.

Differentially Private Data Generation with Missing Data

TL;DR

This work addresses generating synthetic data under differential privacy when the input data contains missing values. It formalizes two DP problem variants—privacy for the incomplete data

and privacy for the ground-truth data

—and models the missing mechanism

as a sampling process to derive tighter privacy bounds, including amplification effects under MCAR. The authors propose three adaptive recourse strategies (DP-MisGAN for GAN-based generation, PrivBayesE for partial-mMarginal observations, and KaminoI for column-wise generation) that integrate missing-data handling into the learning process without extra privacy budget, improving utility by up to 15–72% on four real datasets. They also analyze how missingness can amplify privacy for the ground truth data, yielding bounds like

–

the incomplete-data privacy under certain MCAR conditions, and provide extensive empirical evidence of utility gains across MAR/MNAR and MCAR settings. Overall, the work advances understanding of private synthetic data generation in the presence of missing data and offers practical adaptive methods and privacy- amplification results to guide real-world deployments.

Abstract

Paper Structure (8 sections, 2 theorems, 1 equation, 1 figure)

This paper contains 8 sections, 2 theorems, 1 equation, 1 figure.

Introduction
Preliminaries
Missing Data
Differential Privacy
DP Synthetic Data Generation
Problem Statement
Privacy for Incomplete Data
Vanilla Approaches

Key Result

theorem 1

DBLP:conf/sigmod/McSherry09 We say a transformation $T(\cdot)$ is $c$-stable, if the distance between $T(D)$ and $T(D')$ is at most $c$ times the distance between $D$ and $D'$. The composite mechanism $\mathcal{M} \circ T$ then becomes $(c \cdot \epsilon, \delta)$-DP, for any mechanism $\mathcal{M}$

Figures (1)

Figure 1: Complete row only approach results in poor results for MAR and MNAR missing mechanism.

Theorems & Definitions (3)

definition 1: Differential Privacy (DP) DBLP:journals/fttcs/DworkR14DBLP:conf/eurocrypt/DworkKMMN06
theorem 1
lemma 1

Differentially Private Data Generation with Missing Data

TL;DR

Abstract

Differentially Private Data Generation with Missing Data

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (3)