Privacy Re-identification Attacks on Tabular GANs
Abdallah Alshantti, Adil Rasheed, Frank Westad
TL;DR
This work addresses privacy risks in tabular data synthesis with GANs by introducing re-identification attacks that select synthetic samples likely memorised from training data and reconstructive attacks using evolutionary multi-objective optimisation. It defines three attacker access levels and evaluates selection and reconstruction approaches across four mixed-type tabular datasets using CTGAN, CTAB-GAN, and CasTGAN, with NSGA-II driving the reconstruction process and ASF-based MCDM selecting final outputs. Results show that higher attacker access increases leakage potential, and reconstruction via evolutionary optimisation can further tighten proximity to training data, though at the cost of diversity; differential privacy defences reduce leakage but degrade data utility. These findings highlight a tension between privacy and utility in tabular synthetic data and motivate the development of robust, utility-preserving protections beyond standard differential privacy.
Abstract
Generative models are subject to overfitting and thus may potentially leak sensitive information from the training data. In this work. we investigate the privacy risks that can potentially arise from the use of generative adversarial networks (GANs) for creating tabular synthetic datasets. For the purpose, we analyse the effects of re-identification attacks on synthetic data, i.e., attacks which aim at selecting samples that are predicted to correspond to memorised training samples based on their proximity to the nearest synthetic records. We thus consider multiple settings where different attackers might have different access levels or knowledge of the generative model and predictive, and assess which information is potentially most useful for launching more successful re-identification attacks. In doing so we also consider the situation for which re-identification attacks are formulated as reconstruction attacks, i.e., the situation where an attacker uses evolutionary multi-objective optimisation for perturbing synthetic samples closer to the training space. The results indicate that attackers can indeed pose major privacy risks by selecting synthetic samples that are likely representative of memorised training samples. In addition, we notice that privacy threats considerably increase when the attacker either has knowledge or has black-box access to the generative models. We also find that reconstruction attacks through multi-objective optimisation even increase the risk of identifying confidential samples.
