Privacy-preserving data release leveraging optimal transport and particle gradient descent

Konstantin Donhauser; Javier Abad; Neha Hulkund; Fanny Yang

Privacy-preserving data release leveraging optimal transport and particle gradient descent

Konstantin Donhauser, Javier Abad, Neha Hulkund, Fanny Yang

TL;DR

PrivPGD is introduced, a new generation method for marginal-based private data synthesis, leveraging tools from optimal transport and particle gradient descent, which outperforms existing methods on a large range of datasets while being highly scalable and offering the flexibility to incorporate additional domain-specific constraints.

Abstract

We present a novel approach for differentially private data synthesis of protected tabular datasets, a relevant task in highly sensitive domains such as healthcare and government. Current state-of-the-art methods predominantly use marginal-based approaches, where a dataset is generated from private estimates of the marginals. In this paper, we introduce PrivPGD, a new generation method for marginal-based private data synthesis, leveraging tools from optimal transport and particle gradient descent. Our algorithm outperforms existing methods on a large range of datasets while being highly scalable and offering the flexibility to incorporate additional domain-specific constraints.

Privacy-preserving data release leveraging optimal transport and particle gradient descent

TL;DR

Abstract

Paper Structure (40 sections, 1 theorem, 13 equations, 9 figures, 6 tables, 3 algorithms)

This paper contains 40 sections, 1 theorem, 13 equations, 9 figures, 6 tables, 3 algorithms.

Introduction
Related work
Marginal-based algorithms
Query-based algorithms
Other Algorithms
Preliminaries for differentially private data synthesis
Marginal-based algorithms for private data synthesis
Marginal selection
Marginal privatization
Data generation
Sliced Wasserstein distance
PrivPGD: a particle gradient descent-based generation method
Preliminaries: Embedding
Particles
Projection step
...and 25 more sections

Key Result

Lemma 1

Assume that the noisy marginals $\{\hat{\nu}_{S}\}_{ S \in \mathcal{S}}$ and the regularization loss $\hat{\mathcal{R}}$ are generated from two independent $(\epsilon_1, \delta_1)$- and $(\epsilon_2, \delta_2)$-DP mechanisms. Then, the output of Algorithm alg:pgd, $D_{\text{DP}}$, is $(\epsilon_1 +

Figures (9)

Figure 1: Comparison of PrivPGD with all $2$-Way marginals against state-of-the-art methods based on metrics from Section \ref{['subsec:expsetting']}: 1) downstream error, 2) covariance error, 3) count. queries error, and 4) thresh. queries error, across 9 tabular datasets. For each method, we plot the $\log_2$ ratio of the errors, using PrivPGD's average error as the denominator, and report the mean and standard deviation over 5 runs. We cut at a log ratio of $y=4$ (dashed line) and list all methods exceeding this threshold above this line in order. We set $\epsilon = 2.5$ and $\delta = 10^{-5}$.
Figure 2: Comparison of average $\text{SW}_1$ distance (left) and average TV distance (right) for PrivPGD against state-of-the-art methods across 9 tabular datasets. Similar to Figure \ref{['fig:comp']}, we report the mean and standard deviation (5 runs) of the $\log_2$ ratio of errors. We set $\epsilon = 2.5$ and $\delta = 10^{-5}$.
Figure 3: The absolute error of the domain-specific query (larger is better), the downstream classification error (smaller is better), and the absolute error over counting and thresholding queries (smaller is better), i.e., only the numerator in Equation \ref{['eq:countquery']}, as a function of the log regularization strength $\lambda$. We plot the curves for (a) the Income and (b) the Employment dataset.
Figure 4: Comparison of PrivPGD with all $2$-way marginals against state-of-the-art methods based on metrics from Section \ref{['subsec:expsetting']}: 1) downstream error, 2) covariance error, 3) count. queries error, and 4) thresh. queries error, across 9 tabular datasets. For each method, we plot the $\log_2$ ratio of the errors, using PrivPGD's average error as the denominator, and report the mean and standard deviation over 5 runs. We cut at a log ratio of $y=3$ (dashed line) and list all methods exceeding this threshold above this line in order. We set $\epsilon = 1.0$ and $\delta = 10^{-5}$.
Figure 5: Comparison of PrivPGD with all $2$-way marginals against state-of-the-art methods based on metrics from Section \ref{['subsec:expsetting']}: 1) downstream error, 2) covariance error, 3) count. queries error, and 4) thresh. queries error, across 9 tabular datasets. For each method, we plot the $\log_2$ ratio of the errors, using PrivPGD's average error as the denominator, and report the mean and standard deviation over 5 runs. We cut at a log ratio of $y=3$ (dashed line) and list all methods exceeding this threshold above this line in order. We set $\epsilon = 0.2$ and $\delta = 10^{-5}$.
...and 4 more figures

Theorems & Definitions (4)

Definition 1
Definition 2
Definition 3
Lemma 1

Privacy-preserving data release leveraging optimal transport and particle gradient descent

TL;DR

Abstract

Privacy-preserving data release leveraging optimal transport and particle gradient descent

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (4)