Table of Contents
Fetching ...

Generating High-quality Privacy-preserving Synthetic Data

David Yavo, Richard Khoury, Christophe Pere, Sadoune Ait Kaci Azzou

TL;DR

This work presents a simple, model‑agnostic post‑processing pipeline for synthetic tabular data that targets two practical problems: categorical mode collapse and proximity‑based privacy leakage. It introduces a layer‑frozen mode‑patching step to restore missing categorical modes without retraining from scratch, and a HEOM–$k$NN privacy filter that rejects synthetic rows lying within real‑data neighborhoods, controlled by a threshold $ au_{ ext{ANY}}$. The framework is instantiated with CTGAN and TVAE across three public datasets (Credit, Cardio, Adult), and evaluated on fidelity (univariate/mivariate resemblance), utility (TSTR), and privacy (AIA, CAP, DCR, RPR) metrics. Across datasets, moderate $ au_{ ext{ANY}}$ values (roughly 0.2–0.3) tend to improve categorical and multivariate structure while preserving downstream predictive performance to within ~1% of unfiltered baselines, and improving distance‑based privacy indicators. The results offer practical guidance for applying post‑hoc repairs to synthetic tabular data and complement approaches that provide formal differential privacy guarantees, while highlighting the need for multi‑metric reporting and caution about potential trade‑offs in multivariate structure and rare categories.

Abstract

Synthetic tabular data enables sharing and analysis of sensitive records, but its practical deployment requires balancing distributional fidelity, downstream utility, and privacy protection. We study a simple, model agnostic post processing framework that can be applied on top of any synthetic data generator to improve this trade off. First, a mode patching step repairs categories that are missing or severely underrepresented in the synthetic data, while largely preserving learned dependencies. Second, a k nearest neighbor filter replaces synthetic records that lie too close to real data points, enforcing a minimum distance between real and synthetic samples. We instantiate this framework for two neural generative models for tabular data, a feed forward generator and a variational autoencoder, and evaluate it on three public datasets covering credit card transactions, cardiovascular health, and census based income. We assess marginal and joint distributional similarity, the performance of models trained on synthetic data and evaluated on real data, and several empirical privacy indicators, including nearest neighbor distances and attribute inference attacks. With moderate thresholds between 0.2 and 0.35, the post processing reduces divergence between real and synthetic categorical distributions by up to 36 percent and improves a combined measure of pairwise dependence preservation by 10 to 14 percent, while keeping downstream predictive performance within about 1 percent of the unprocessed baseline. At the same time, distance based privacy indicators improve and the success rate of attribute inference attacks remains largely unchanged. These results provide practical guidance for selecting thresholds and applying post hoc repairs to improve the quality and empirical privacy of synthetic tabular data, while complementing approaches that provide formal differential privacy guarantees.

Generating High-quality Privacy-preserving Synthetic Data

TL;DR

This work presents a simple, model‑agnostic post‑processing pipeline for synthetic tabular data that targets two practical problems: categorical mode collapse and proximity‑based privacy leakage. It introduces a layer‑frozen mode‑patching step to restore missing categorical modes without retraining from scratch, and a HEOM–NN privacy filter that rejects synthetic rows lying within real‑data neighborhoods, controlled by a threshold . The framework is instantiated with CTGAN and TVAE across three public datasets (Credit, Cardio, Adult), and evaluated on fidelity (univariate/mivariate resemblance), utility (TSTR), and privacy (AIA, CAP, DCR, RPR) metrics. Across datasets, moderate values (roughly 0.2–0.3) tend to improve categorical and multivariate structure while preserving downstream predictive performance to within ~1% of unfiltered baselines, and improving distance‑based privacy indicators. The results offer practical guidance for applying post‑hoc repairs to synthetic tabular data and complement approaches that provide formal differential privacy guarantees, while highlighting the need for multi‑metric reporting and caution about potential trade‑offs in multivariate structure and rare categories.

Abstract

Synthetic tabular data enables sharing and analysis of sensitive records, but its practical deployment requires balancing distributional fidelity, downstream utility, and privacy protection. We study a simple, model agnostic post processing framework that can be applied on top of any synthetic data generator to improve this trade off. First, a mode patching step repairs categories that are missing or severely underrepresented in the synthetic data, while largely preserving learned dependencies. Second, a k nearest neighbor filter replaces synthetic records that lie too close to real data points, enforcing a minimum distance between real and synthetic samples. We instantiate this framework for two neural generative models for tabular data, a feed forward generator and a variational autoencoder, and evaluate it on three public datasets covering credit card transactions, cardiovascular health, and census based income. We assess marginal and joint distributional similarity, the performance of models trained on synthetic data and evaluated on real data, and several empirical privacy indicators, including nearest neighbor distances and attribute inference attacks. With moderate thresholds between 0.2 and 0.35, the post processing reduces divergence between real and synthetic categorical distributions by up to 36 percent and improves a combined measure of pairwise dependence preservation by 10 to 14 percent, while keeping downstream predictive performance within about 1 percent of the unprocessed baseline. At the same time, distance based privacy indicators improve and the success rate of attribute inference attacks remains largely unchanged. These results provide practical guidance for selecting thresholds and applying post hoc repairs to improve the quality and empirical privacy of synthetic tabular data, while complementing approaches that provide formal differential privacy guarantees.
Paper Structure (107 sections, 45 equations, 20 figures, 50 tables, 4 algorithms)

This paper contains 107 sections, 45 equations, 20 figures, 50 tables, 4 algorithms.

Figures (20)

  • Figure 1: Iterative layer-frozen mode-patching procedure. Starting from an initial synthetic set $S$ drawn from the fixed generator $G_\theta^{(0)}$, the algorithm repeatedly detects categorical modes present in the real data $R$ but missing in $S$, fine-tunes a copy of $G_\theta^{(0)}$ on the corresponding slice $D_c$ with its lower layers frozen, and replaces over-represented synthetic rows with samples $S_c$ from the adapted generator. The loop stops once no real category is missing in $S$.
  • Figure 2: Schematic of the CTGAN architecture highlighting frozen layers (in red) during fine-tuning. The generator’s lower layers (up to the freezing point) remain fixed while only the later layers are retrained to produce a specific rare class. A similar freezing approach can be applied to the discriminator’s feature layers to maintain stability.
  • Figure 3: Schematic of the TVAE (tabular VAE) architecture with a freezing point between the encoder and decoder. In fine-tuning, the encoder and the first part of the decoder are frozen (red), and only the last layers of the decoder are adjusted. This focuses the adaptation on generating a missing category without disturbing the overall latent structure.
  • Figure 4: Baseline downstream utility of the unfiltered CTGAN and TVAE samples, evaluated by ROC--AUC under the TRTR and TSTR protocols. Columns correspond to generators (left: CTGAN, right: TVAE) and rows to datasets (top: Credit, middle: Cardio, bottom: Adult). In each panel, every marker denotes one classifier from the suite (see legend). The horizontal axis shows ROC--AUC when the model is trained and tested on real data (TRTR), while the vertical axis shows ROC--AUC when the same model is trained on synthetic data and tested on real data (TSTR). The dashed diagonal marks parity between TSTR and TRTR; points below (above) this line indicate loss (gain) in ROC--AUC when training on synthetic instead of real data.
  • Figure 5: Privacy–utility trade‑off under HEOM--kNN $\widehat{\varepsilon}_{\mathrm{ANY}}$ filtering. For each dataset–generator pair we post‑process $G$’s samples with Alg. \ref{['alg:heom-any']}, varying the target bound $\tau\equiv\tau_{\mathrm{ANY}}$ (smaller $\Rightarrow$ tighter privacy). The $y$‑axis reports the Jensen–Shannon (JS) divergence between real and synthetic categorical marginals. Thin colored curves are per‑attribute JS; the thick black curve is the mean across attributes; the gray band is $\pm 1$ s.d.; the dotted vertical line and black diamond mark the unfiltered baseline (no rejection). The annotated $\tau^\star$ in each panel is the value that minimizes the mean JS for that setting. Panels (A–F), following the order in the figure: Credit–CTGAN ($\tau^\star\!\approx\!0.2$), Credit–TVAE ($\tau^\star\!\approx\!0.4$), Adult–CTGAN ($\tau^\star\!\approx\!0.35$), Adult–TVAE ($\tau^\star\!\approx\!0.25$), Cardio–CTGAN ($\tau^\star\!\approx\!0.3$), and Cardio–TVAE ($\tau^\star\!\approx\!5\times10^{-3}$). Overall, enforcing a tighter $\tau$ can either improve (e.g., Cardio–TVAE) or degrade (e.g., Credit–TVAE) categorical utility, depending on the generator and dataset.
  • ...and 15 more figures