Table of Contents
Fetching ...

Synthesizing Realistic Test Data without Breaking Privacy

Laura Plein, Alexi Turcotte, Arina Hallemans, Andreas Zeller

TL;DR

The paper tackles the challenge of producing synthetic data that preserves the statistical properties of real data without exposing private information. It replaces the GAN generator with a grammar-based fuzzer (Fandango) guided by a discriminator trained on private data, enabling iterative refinement toward the original distribution while limiting data leakage. Through four tabular datasets, the approach demonstrates feasible generation of good samples, competitive utility for downstream tasks, and a measured balance between resemblance and privacy as quantified by distributional metrics like the Wasserstein distance. The method offers a low-resource alternative to GAN-based synthetic data, with practical implications for privacy-preserving testing and evaluation in sensitive domains, and outlines concrete paths for broader validation and enhancement.

Abstract

There is a need for synthetic training and test datasets that replicate statistical distributions of original datasets without compromising their confidentiality. A lot of research has been done in leveraging Generative Adversarial Networks (GANs) for synthetic data generation. However, the resulting models are either not accurate enough or are still vulnerable to membership inference attacks (MIA) or dataset reconstruction attacks since the original data has been leveraged in the training process. In this paper, we explore the feasibility of producing a synthetic test dataset with the same statistical properties as the original one, with only indirectly leveraging the original data in the generation process. The approach is inspired by GANs, with a generation step and a discrimination step. However, in our approach, we use a test generator (a fuzzer) to produce test data from an input specification, preserving constraints set by the original data; a discriminator model determines how close we are to the original data. By evolving samples and determining "good samples" with the discriminator, we can generate privacy-preserving data that follows the same statistical distributions are the original dataset, leading to a similar utility as the original data. We evaluated our approach on four datasets that have been used to evaluate the state-of-the-art techniques. Our experiments highlight the potential of our approach towards generating synthetic datasets that have high utility while preserving privacy.

Synthesizing Realistic Test Data without Breaking Privacy

TL;DR

The paper tackles the challenge of producing synthetic data that preserves the statistical properties of real data without exposing private information. It replaces the GAN generator with a grammar-based fuzzer (Fandango) guided by a discriminator trained on private data, enabling iterative refinement toward the original distribution while limiting data leakage. Through four tabular datasets, the approach demonstrates feasible generation of good samples, competitive utility for downstream tasks, and a measured balance between resemblance and privacy as quantified by distributional metrics like the Wasserstein distance. The method offers a low-resource alternative to GAN-based synthetic data, with practical implications for privacy-preserving testing and evaluation in sensitive domains, and outlines concrete paths for broader validation and enhancement.

Abstract

There is a need for synthetic training and test datasets that replicate statistical distributions of original datasets without compromising their confidentiality. A lot of research has been done in leveraging Generative Adversarial Networks (GANs) for synthetic data generation. However, the resulting models are either not accurate enough or are still vulnerable to membership inference attacks (MIA) or dataset reconstruction attacks since the original data has been leveraged in the training process. In this paper, we explore the feasibility of producing a synthetic test dataset with the same statistical properties as the original one, with only indirectly leveraging the original data in the generation process. The approach is inspired by GANs, with a generation step and a discrimination step. However, in our approach, we use a test generator (a fuzzer) to produce test data from an input specification, preserving constraints set by the original data; a discriminator model determines how close we are to the original data. By evolving samples and determining "good samples" with the discriminator, we can generate privacy-preserving data that follows the same statistical distributions are the original dataset, leading to a similar utility as the original data. We evaluated our approach on four datasets that have been used to evaluate the state-of-the-art techniques. Our experiments highlight the potential of our approach towards generating synthetic datasets that have high utility while preserving privacy.
Paper Structure (29 sections, 4 figures, 5 tables)

This paper contains 29 sections, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Example CSV grammar and constraints.
  • Figure 2: Overview of the approach.
  • Figure 3: Comparison of the good samples collection rate across different datasets
  • Figure 4: Good samples for insurance dataset, with discriminator retraining. Each orange vertical dotted line represents a new iteration, with a newly trained discriminator.