Table of Contents
Fetching ...

Beyond One-Size-Fits-All: Neural Networks for Differentially Private Tabular Data Synthesis

Kai Chen, Chen Gong, Tianhao Wang

TL;DR

The paper tackles DP tabular data synthesis by challenging the notion that statistical methods are universally superior. It introduces MargNet, a neural-network framework that uses adaptive marginal selection to learn and generate data consistent with selected marginals under DP guarantees, aiming to handle densely correlated attributes. The authors provide theoretical bounds on marginal fitting errors and demonstrate through extensive experiments that MargNet achieves strong utility on densely correlated data, competitive performance on sparse data with significant speedups, and complementary strengths relative to AIM. The work suggests that algorithm choice should be dataset-dependent and that neural-network based approaches, when carefully designed around marginal information, can offer scalable, high-utility DP synthesis for complex tabular data.

Abstract

In differentially private (DP) tabular data synthesis, the consensus is that statistical models are better than neural network (NN)-based methods. However, we argue that this conclusion is incomplete and overlooks the challenge of densely correlated datasets, where intricate dependencies can overwhelm statistical models. In such complex scenarios, neural networks are more suitable due to their capacity to fit complex distributions by learning directly from samples. Despite this potential, existing NN-based algorithms still suffer from significant limitations. We therefore propose MargNet, incorporating successful algorithmic designs of statistical models into neural networks. MargNet applies an adaptive marginal selection strategy and trains the neural networks to generate data that conforms to the selected marginals. On sparsely correlated datasets, our approach achieves utility close to the best statistical method while offering an average 7$\times$ speedup over it. More importantly, on densely correlated datasets, MargNet establishes a new state-of-the-art, reducing fidelity error by up to 26\% compared to the previous best. We release our code on GitHub.\footnote{https://github.com/KaiChen9909/margnet}

Beyond One-Size-Fits-All: Neural Networks for Differentially Private Tabular Data Synthesis

TL;DR

The paper tackles DP tabular data synthesis by challenging the notion that statistical methods are universally superior. It introduces MargNet, a neural-network framework that uses adaptive marginal selection to learn and generate data consistent with selected marginals under DP guarantees, aiming to handle densely correlated attributes. The authors provide theoretical bounds on marginal fitting errors and demonstrate through extensive experiments that MargNet achieves strong utility on densely correlated data, competitive performance on sparse data with significant speedups, and complementary strengths relative to AIM. The work suggests that algorithm choice should be dataset-dependent and that neural-network based approaches, when carefully designed around marginal information, can offer scalable, high-utility DP synthesis for complex tabular data.

Abstract

In differentially private (DP) tabular data synthesis, the consensus is that statistical models are better than neural network (NN)-based methods. However, we argue that this conclusion is incomplete and overlooks the challenge of densely correlated datasets, where intricate dependencies can overwhelm statistical models. In such complex scenarios, neural networks are more suitable due to their capacity to fit complex distributions by learning directly from samples. Despite this potential, existing NN-based algorithms still suffer from significant limitations. We therefore propose MargNet, incorporating successful algorithmic designs of statistical models into neural networks. MargNet applies an adaptive marginal selection strategy and trains the neural networks to generate data that conforms to the selected marginals. On sparsely correlated datasets, our approach achieves utility close to the best statistical method while offering an average 7 speedup over it. More importantly, on densely correlated datasets, MargNet establishes a new state-of-the-art, reducing fidelity error by up to 26\% compared to the previous best. We release our code on GitHub.\footnote{https://github.com/KaiChen9909/margnet}

Paper Structure

This paper contains 42 sections, 12 theorems, 35 equations, 9 figures, 11 tables, 4 algorithms.

Key Result

Proposition 1

Let $f: \mathcal{D} \rightarrow \mathcal{R}_1$ be $\rho_1$-zCDP and $g: \mathcal{R}_1 \times \mathcal{D} \rightarrow \mathcal{R}_2$ be $\rho_2$-zCDP respectively. Then the mechanism defined as $(X,Y)$, where $X \sim f(D)$ and $Y \sim g(D, f(D))$, satisfies $(\rho_1+\rho_2)$-zCDP.

Figures (9)

  • Figure 1: Examples of PGM's attributes grouping process. The blue nodes represent attributes, and the edges between nodes are marginals. The densely correlated dataset leads to cliques with larger domain sizes.
  • Figure 2: A brief illustration of MargNet. First, the model is initialized with all one-way marginals. Then, an adaptive marginal selection and model fitting step is conducted, which returns the model to synthesize data.
  • Figure 3: Heatmap on absolute values of pairwise correlations in the real-world datasets. The darker the color, the stronger the correlation.
  • Figure 4: Evaluation results (machine learning efficacy, query error, and fidelity error) on sparsely correlated datasets. The three columns of figures depict the results on three datasets (Adult, Bank, and Loan) from left to right, respectively.
  • Figure 5: Different algorithms' query errors, fidelity errors, and running time on densely correlated datasets. The three columns of figures depict the results on three datasets, Gauss10, Gauss30, and Gauss50 from left to right, respectively.
  • ...and 4 more figures

Theorems & Definitions (18)

  • Definition 1: Differential Privacy
  • Definition 2: zero-Concentrated DP bun2016concentrateddifferentialprivacysimplifications
  • Proposition 1: Composition
  • Proposition 2: Post-Processing
  • Proposition 3
  • Definition 3: Sensitivity
  • Proposition 4
  • Proposition 5
  • Definition 4: Tabular Dataset
  • Definition 5: Marginal
  • ...and 8 more