Table of Contents
Fetching ...

Benchmarking Differentially Private Tabular Data Synthesis

Kai Chen, Xiaochen Li, Chen Gong, Ryan McKenna, Tianhao Wang

TL;DR

This work tackles the difficulty of fairly evaluating DP tabular data synthesis by introducing a unified benchmark with preprocessing, feature selection, and synthesis modules. It formalizes DP via $(\alpha,\varepsilon)$-Rényi DP, and conducts in-depth, module-level analyses across both statistical and deep-learning methods, including recent approaches like AIM, PrivMRF, RAP++, Private-GSD, and TabDDPM. The experimental results across five public datasets reveal a pronounced utility-efficiency trade-off and demonstrate that preprocessing is crucial for fair comparisons and efficiency, with adaptive feature selection offering utility gains at some computational cost. The benchmark is open-source and designed to guide practitioners in choosing methods that balance privacy, utility, and compute constraints in real-world settings.

Abstract

Differentially private (DP) tabular data synthesis generates artificial data that preserves the statistical properties of private data while safeguarding individual privacy. The emergence of diverse algorithms in recent years has introduced challenges in practical applications, such as inconsistent data processing methods, the lack of in-depth algorithm analysis, and incomplete comparisons due to overlapping development timelines. These factors create significant obstacles to selecting appropriate algorithms. In this paper, we address these challenges by proposing a benchmark for evaluating tabular data synthesis methods. We present a unified evaluation framework that integrates data preprocessing, feature selection, and synthesis modules, facilitating fair and comprehensive comparisons. Our evaluation reveals that a significant utility-efficiency trade-off exists among current state-of-the-art methods. Some statistical methods are superior in synthesis utility, but their efficiency is not as good as most deep learning-based methods. Furthermore, we conduct an in-depth analysis of each module with experimental validation, offering theoretical insights into the strengths and limitations of different strategies. Our code is open-sourced via the link.\footnote{https://github.com/KaiChen9909/tab_bench}

Benchmarking Differentially Private Tabular Data Synthesis

TL;DR

This work tackles the difficulty of fairly evaluating DP tabular data synthesis by introducing a unified benchmark with preprocessing, feature selection, and synthesis modules. It formalizes DP via -Rényi DP, and conducts in-depth, module-level analyses across both statistical and deep-learning methods, including recent approaches like AIM, PrivMRF, RAP++, Private-GSD, and TabDDPM. The experimental results across five public datasets reveal a pronounced utility-efficiency trade-off and demonstrate that preprocessing is crucial for fair comparisons and efficiency, with adaptive feature selection offering utility gains at some computational cost. The benchmark is open-source and designed to guide practitioners in choosing methods that balance privacy, utility, and compute constraints in real-world settings.

Abstract

Differentially private (DP) tabular data synthesis generates artificial data that preserves the statistical properties of private data while safeguarding individual privacy. The emergence of diverse algorithms in recent years has introduced challenges in practical applications, such as inconsistent data processing methods, the lack of in-depth algorithm analysis, and incomplete comparisons due to overlapping development timelines. These factors create significant obstacles to selecting appropriate algorithms. In this paper, we address these challenges by proposing a benchmark for evaluating tabular data synthesis methods. We present a unified evaluation framework that integrates data preprocessing, feature selection, and synthesis modules, facilitating fair and comprehensive comparisons. Our evaluation reveals that a significant utility-efficiency trade-off exists among current state-of-the-art methods. Some statistical methods are superior in synthesis utility, but their efficiency is not as good as most deep learning-based methods. Furthermore, we conduct an in-depth analysis of each module with experimental validation, offering theoretical insights into the strengths and limitations of different strategies. Our code is open-sourced via the link.\footnote{https://github.com/KaiChen9909/tab_bench}

Paper Structure

This paper contains 41 sections, 9 theorems, 24 equations, 8 figures, 26 tables, 2 algorithms.

Key Result

Theorem 1

If $f$ is an $(\alpha, \varepsilon)$-RDP mechanism, then it also satisfy $\left(\varepsilon + \frac{\log{1/\delta}}{\alpha - 1}, \delta\right)$-DP for any $0 < \delta < 1$.

Figures (8)

  • Figure 1: The proposed unified framework. The dataset is first preprocessed and then represented by some selected features. Finally, using the selected features, the synthesis algorithm generates data as the output of the workflow.
  • Figure 2: Heat map of pairwise absolute correlations and histogram of attribute distribution's Shannon entropy.
  • Figure 3: T-SNE scatter plots of synthesis results on Bank dataset under $\varepsilon = 1.0$
  • Figure 4: Figures for analyzing different methods in \ref{['overall exp']}
  • Figure 5: Scaled utility and efficiency of different algorithms. Average utility is obtained by taking the average of ML efficacy, query error, and fidelity error after normalizing them to $[0,1]$, guaranteeing their equal contribution to the aggregated metric. Other metrics are directly obtained from the average value of normalized original metrics.
  • ...and 3 more figures

Theorems & Definitions (12)

  • Definition 1: Differential Privacy
  • Definition 2: Rényi DP mironov2017renyi
  • Theorem 1
  • Theorem 2
  • Theorem 3: Composition
  • Theorem 4: Post-Processing
  • Definition 3: Sensitivity
  • Theorem 5
  • Theorem 6
  • Lemma 1
  • ...and 2 more