Table of Contents
Fetching ...

Privacy-preserving data release leveraging optimal transport and particle gradient descent

Konstantin Donhauser, Javier Abad, Neha Hulkund, Fanny Yang

TL;DR

PrivPGD is introduced, a new generation method for marginal-based private data synthesis, leveraging tools from optimal transport and particle gradient descent, which outperforms existing methods on a large range of datasets while being highly scalable and offering the flexibility to incorporate additional domain-specific constraints.

Abstract

We present a novel approach for differentially private data synthesis of protected tabular datasets, a relevant task in highly sensitive domains such as healthcare and government. Current state-of-the-art methods predominantly use marginal-based approaches, where a dataset is generated from private estimates of the marginals. In this paper, we introduce PrivPGD, a new generation method for marginal-based private data synthesis, leveraging tools from optimal transport and particle gradient descent. Our algorithm outperforms existing methods on a large range of datasets while being highly scalable and offering the flexibility to incorporate additional domain-specific constraints.

Privacy-preserving data release leveraging optimal transport and particle gradient descent

TL;DR

PrivPGD is introduced, a new generation method for marginal-based private data synthesis, leveraging tools from optimal transport and particle gradient descent, which outperforms existing methods on a large range of datasets while being highly scalable and offering the flexibility to incorporate additional domain-specific constraints.

Abstract

We present a novel approach for differentially private data synthesis of protected tabular datasets, a relevant task in highly sensitive domains such as healthcare and government. Current state-of-the-art methods predominantly use marginal-based approaches, where a dataset is generated from private estimates of the marginals. In this paper, we introduce PrivPGD, a new generation method for marginal-based private data synthesis, leveraging tools from optimal transport and particle gradient descent. Our algorithm outperforms existing methods on a large range of datasets while being highly scalable and offering the flexibility to incorporate additional domain-specific constraints.
Paper Structure (40 sections, 1 theorem, 13 equations, 9 figures, 6 tables, 3 algorithms)

This paper contains 40 sections, 1 theorem, 13 equations, 9 figures, 6 tables, 3 algorithms.

Key Result

Lemma 1

Assume that the noisy marginals $\{\hat{\nu}_{S}\}_{ S \in \mathcal{S}}$ and the regularization loss $\hat{\mathcal{R}}$ are generated from two independent $(\epsilon_1, \delta_1)$- and $(\epsilon_2, \delta_2)$-DP mechanisms. Then, the output of Algorithm alg:pgd, $D_{\text{DP}}$, is $(\epsilon_1 +

Figures (9)

  • Figure 1: Comparison of PrivPGD with all $2$-Way marginals against state-of-the-art methods based on metrics from Section \ref{['subsec:expsetting']}: 1) downstream error, 2) covariance error, 3) count. queries error, and 4) thresh. queries error, across 9 tabular datasets. For each method, we plot the $\log_2$ ratio of the errors, using PrivPGD's average error as the denominator, and report the mean and standard deviation over 5 runs. We cut at a log ratio of $y=4$ (dashed line) and list all methods exceeding this threshold above this line in order. We set $\epsilon = 2.5$ and $\delta = 10^{-5}$.
  • Figure 2: Comparison of average $\text{SW}_1$ distance (left) and average TV distance (right) for PrivPGD against state-of-the-art methods across 9 tabular datasets. Similar to Figure \ref{['fig:comp']}, we report the mean and standard deviation (5 runs) of the $\log_2$ ratio of errors. We set $\epsilon = 2.5$ and $\delta = 10^{-5}$.
  • Figure 3: The absolute error of the domain-specific query (larger is better), the downstream classification error (smaller is better), and the absolute error over counting and thresholding queries (smaller is better), i.e., only the numerator in Equation \ref{['eq:countquery']}, as a function of the log regularization strength $\lambda$. We plot the curves for (a) the Income and (b) the Employment dataset.
  • Figure 4: Comparison of PrivPGD with all $2$-way marginals against state-of-the-art methods based on metrics from Section \ref{['subsec:expsetting']}: 1) downstream error, 2) covariance error, 3) count. queries error, and 4) thresh. queries error, across 9 tabular datasets. For each method, we plot the $\log_2$ ratio of the errors, using PrivPGD's average error as the denominator, and report the mean and standard deviation over 5 runs. We cut at a log ratio of $y=3$ (dashed line) and list all methods exceeding this threshold above this line in order. We set $\epsilon = 1.0$ and $\delta = 10^{-5}$.
  • Figure 5: Comparison of PrivPGD with all $2$-way marginals against state-of-the-art methods based on metrics from Section \ref{['subsec:expsetting']}: 1) downstream error, 2) covariance error, 3) count. queries error, and 4) thresh. queries error, across 9 tabular datasets. For each method, we plot the $\log_2$ ratio of the errors, using PrivPGD's average error as the denominator, and report the mean and standard deviation over 5 runs. We cut at a log ratio of $y=3$ (dashed line) and list all methods exceeding this threshold above this line in order. We set $\epsilon = 0.2$ and $\delta = 10^{-5}$.
  • ...and 4 more figures

Theorems & Definitions (4)

  • Definition 1
  • Definition 2
  • Definition 3
  • Lemma 1