Empirical Evaluation of Structured Synthetic Data Privacy Metrics: Novel experimental framework
Milton Nicolás Plasencia Palacios, Alexander Boudewijn, Sebastiano Saccani, Andrea Filippo Ferraris, Diana Sofronieva, Giuseppe D'Acquisto, Filiberto Brozzetti, Daniele Panfilo, Luca Bortolussi
TL;DR
The work tackles the lack of standardized, GDPR-aligned privacy evaluation for synthetic tabular data by proposing an empirical benchmarking framework that injects controlled risk to evaluate privacy metrics. It surveys existing privacy quantification methods and applies a no-box threat-model evaluation across public datasets, revealing correlations among metrics and data-dependent behavior. The findings suggest that statistical indicators offer robust, efficient guidance while attack-based measures align with specific threat models but depend on data properties such as outliers and attribute types. The paper argues for a multi-metric, dataset-specific approach to determine anonymization status, bridging legal concepts with practical privacy assessment for synthetic data as a PET.
Abstract
Synthetic data generation is gaining traction as a privacy enhancing technology (PET). When properly generated, synthetic data preserve the analytic utility of real data while avoiding the retention of information that would allow the identification of specific individuals. However, the concept of data privacy remains elusive, making it challenging for practitioners to evaluate and benchmark the degree of privacy protection offered by synthetic data. In this paper, we propose a framework to empirically assess the efficacy of tabular synthetic data privacy quantification methods through controlled, deliberate risk insertion. To demonstrate this framework, we survey existing approaches to synthetic data privacy quantification and the related legal theory. We then apply the framework to the main privacy quantification methods with no-box threat models on publicly available datasets.
