Table of Contents
Fetching ...

Empirical Evaluation of Structured Synthetic Data Privacy Metrics: Novel experimental framework

Milton Nicolás Plasencia Palacios, Alexander Boudewijn, Sebastiano Saccani, Andrea Filippo Ferraris, Diana Sofronieva, Giuseppe D'Acquisto, Filiberto Brozzetti, Daniele Panfilo, Luca Bortolussi

TL;DR

The work tackles the lack of standardized, GDPR-aligned privacy evaluation for synthetic tabular data by proposing an empirical benchmarking framework that injects controlled risk to evaluate privacy metrics. It surveys existing privacy quantification methods and applies a no-box threat-model evaluation across public datasets, revealing correlations among metrics and data-dependent behavior. The findings suggest that statistical indicators offer robust, efficient guidance while attack-based measures align with specific threat models but depend on data properties such as outliers and attribute types. The paper argues for a multi-metric, dataset-specific approach to determine anonymization status, bridging legal concepts with practical privacy assessment for synthetic data as a PET.

Abstract

Synthetic data generation is gaining traction as a privacy enhancing technology (PET). When properly generated, synthetic data preserve the analytic utility of real data while avoiding the retention of information that would allow the identification of specific individuals. However, the concept of data privacy remains elusive, making it challenging for practitioners to evaluate and benchmark the degree of privacy protection offered by synthetic data. In this paper, we propose a framework to empirically assess the efficacy of tabular synthetic data privacy quantification methods through controlled, deliberate risk insertion. To demonstrate this framework, we survey existing approaches to synthetic data privacy quantification and the related legal theory. We then apply the framework to the main privacy quantification methods with no-box threat models on publicly available datasets.

Empirical Evaluation of Structured Synthetic Data Privacy Metrics: Novel experimental framework

TL;DR

The work tackles the lack of standardized, GDPR-aligned privacy evaluation for synthetic tabular data by proposing an empirical benchmarking framework that injects controlled risk to evaluate privacy metrics. It surveys existing privacy quantification methods and applies a no-box threat-model evaluation across public datasets, revealing correlations among metrics and data-dependent behavior. The findings suggest that statistical indicators offer robust, efficient guidance while attack-based measures align with specific threat models but depend on data properties such as outliers and attribute types. The paper argues for a multi-metric, dataset-specific approach to determine anonymization status, bridging legal concepts with practical privacy assessment for synthetic data as a PET.

Abstract

Synthetic data generation is gaining traction as a privacy enhancing technology (PET). When properly generated, synthetic data preserve the analytic utility of real data while avoiding the retention of information that would allow the identification of specific individuals. However, the concept of data privacy remains elusive, making it challenging for practitioners to evaluate and benchmark the degree of privacy protection offered by synthetic data. In this paper, we propose a framework to empirically assess the efficacy of tabular synthetic data privacy quantification methods through controlled, deliberate risk insertion. To demonstrate this framework, we survey existing approaches to synthetic data privacy quantification and the related legal theory. We then apply the framework to the main privacy quantification methods with no-box threat models on publicly available datasets.

Paper Structure

This paper contains 60 sections, 10 equations, 27 figures, 3 tables.

Figures (27)

  • Figure 1: Overview of synthetic data privacy quantification methods
  • Figure 2: Synthetic data degree of privacy quantification methods used in this study. Risk: risk measured or controlled; Aux: auxiliary information; MIA: membership inference attack; AIA: attribute inference attack; disc.: disclosure; SO: singling out; Link: linkability; ML: machine learning; IoS: inference-on-synthetic; LN: local neighborhood. Exactly the attribute disclosure attacks require access to auxiliary information. Nb: examples are for reference only: they may implement same attack methods and mechanisms in different manner than our implementations.
  • Figure 3: Results of the leaky and overfitting risk models. RTF: RealTabFormer; O: outlier; D: distance; ML: machine learning. By "no risk", we indicate that no risk was deliberately added, i.e. $f_l=0$ for the leaky risk model; $f_o=1$ for the overfitting risk model; for the DP risk model, we equate "no risk" to a privacy budget of $\varepsilon = 0$. Simliarly, "max risk" refers to $f_l=1$; $f_o=2$; and $\varepsilon=100$. We use the asterisk (*) to denote the maximum risk, which exceeds any risk achieved with the previous values of $f_l$, $f_o$, or $\varepsilon$.
  • Figure 4: Risk assessment methods evaluated using the leaky risk model
  • Figure 5: Risk assessment methods evaluated using the overfit risk model - RTF solatorio
  • ...and 22 more figures

Theorems & Definitions (3)

  • Definition 2.1
  • Definition 2.2
  • Definition 4.1