Generating High-quality Privacy-preserving Synthetic Data
David Yavo, Richard Khoury, Christophe Pere, Sadoune Ait Kaci Azzou
TL;DR
This work presents a simple, model‑agnostic post‑processing pipeline for synthetic tabular data that targets two practical problems: categorical mode collapse and proximity‑based privacy leakage. It introduces a layer‑frozen mode‑patching step to restore missing categorical modes without retraining from scratch, and a HEOM–$k$NN privacy filter that rejects synthetic rows lying within real‑data neighborhoods, controlled by a threshold $ au_{ ext{ANY}}$. The framework is instantiated with CTGAN and TVAE across three public datasets (Credit, Cardio, Adult), and evaluated on fidelity (univariate/mivariate resemblance), utility (TSTR), and privacy (AIA, CAP, DCR, RPR) metrics. Across datasets, moderate $ au_{ ext{ANY}}$ values (roughly 0.2–0.3) tend to improve categorical and multivariate structure while preserving downstream predictive performance to within ~1% of unfiltered baselines, and improving distance‑based privacy indicators. The results offer practical guidance for applying post‑hoc repairs to synthetic tabular data and complement approaches that provide formal differential privacy guarantees, while highlighting the need for multi‑metric reporting and caution about potential trade‑offs in multivariate structure and rare categories.
Abstract
Synthetic tabular data enables sharing and analysis of sensitive records, but its practical deployment requires balancing distributional fidelity, downstream utility, and privacy protection. We study a simple, model agnostic post processing framework that can be applied on top of any synthetic data generator to improve this trade off. First, a mode patching step repairs categories that are missing or severely underrepresented in the synthetic data, while largely preserving learned dependencies. Second, a k nearest neighbor filter replaces synthetic records that lie too close to real data points, enforcing a minimum distance between real and synthetic samples. We instantiate this framework for two neural generative models for tabular data, a feed forward generator and a variational autoencoder, and evaluate it on three public datasets covering credit card transactions, cardiovascular health, and census based income. We assess marginal and joint distributional similarity, the performance of models trained on synthetic data and evaluated on real data, and several empirical privacy indicators, including nearest neighbor distances and attribute inference attacks. With moderate thresholds between 0.2 and 0.35, the post processing reduces divergence between real and synthetic categorical distributions by up to 36 percent and improves a combined measure of pairwise dependence preservation by 10 to 14 percent, while keeping downstream predictive performance within about 1 percent of the unprocessed baseline. At the same time, distance based privacy indicators improve and the success rate of attribute inference attacks remains largely unchanged. These results provide practical guidance for selecting thresholds and applying post hoc repairs to improve the quality and empirical privacy of synthetic tabular data, while complementing approaches that provide formal differential privacy guarantees.
