Generating Artificial Data for Private Deep Learning
Aleksei Triastcyn, Boi Faltings
TL;DR
This work tackles privacy risks in ML by proposing private data release through GAN-generated artificial data that preserve key statistical properties of real data. A Differentially Private (DP) critic is added to the GAN to improve sample diversity and privacy, while an empirical privacy-estimation framework uses KL divergence and Chebyshev bounds to bound expected privacy loss post hoc. Experiments on MNIST, SVHN, and CelebA show that models trained on artificial data achieve competitive accuracy versus non-private baselines and against DP-based model-release methods, with measurable reductions in information leakage as evidenced by model-inversion attacks. The approach enables flexible, scalable private data publishing and data pooling, though it provides empirical rather than worst-case DP guarantees and faces typical GAN limitations. This suggests a practical path toward privacy-preserving data sharing, data markets, and reproducible research with high-utility synthetic data and interpretable privacy bounds.
Abstract
In this paper, we propose generating artificial data that retain statistical properties of real data as the means of providing privacy with respect to the original dataset. We use generative adversarial network to draw privacy-preserving artificial data samples and derive an empirical method to assess the risk of information disclosure in a differential-privacy-like way. Our experiments show that we are able to generate artificial data of high quality and successfully train and validate machine learning models on this data while limiting potential privacy loss.
