Table of Contents
Fetching ...

Contrastive Learning-Based privacy metrics in Tabular Synthetic Datasets

Milton Nicolás Plasencia Palacios, Sebastiano Saccani, Gabriele Sgroi, Alexander Boudewijn, Luca Bortolussi

TL;DR

This work addresses privacy evaluation for tabular synthetic data by introducing a contrastive-learning embedding that yields a representative distance space in which standard distance-based privacy indicators (SRD/RRD) and empirical attack-based metrics can be applied consistently across heterogeneous attributes. The authors show that distance-to-closest-record and derived privacy scores, when computed in the contrastive embedding space, reproduce privacy signals comparable to attack-based measures while offering greater efficiency and lower variance. Across Adult, Texas, and Census datasets, the embedding-enhanced DCR often matches or exceeds the performance of explicit attacks and demonstrates substantially faster computation, though dataset-specific behavior (e.g., embedded-space failures on certain datasets) highlights that embedding benefits are data-dependent. The study suggests practical pathways for scalable privacy assessment of synthetic tabular data and outlines future work to refine when and how embeddings improve privacy metrics, including integration with other anomaly/outlier detection methods and evaluation on larger-scale data.

Abstract

Synthetic data has garnered attention as a Privacy Enhancing Technology (PET) in sectors such as healthcare and finance. When using synthetic data in practical applications, it is important to provide protection guarantees. In the literature, two family of approaches are proposed for tabular data: on the one hand, Similarity-based methods aim at finding the level of similarity between training and synthetic data. Indeed, a privacy breach can occur if the generated data is consistently too similar or even identical to the train data. On the other hand, Attack-based methods conduce deliberate attacks on synthetic datasets. The success rates of these attacks reveal how secure the synthetic datasets are. In this paper, we introduce a contrastive method that improves privacy assessment of synthetic datasets by embedding the data in a more representative space. This overcomes obstacles surrounding the multitude of data types and attributes. It also makes the use of intuitive distance metrics possible for similarity measurements and as an attack vector. In a series of experiments with publicly available datasets, we compare the performances of similarity-based and attack-based methods, both with and without use of the contrastive learning-based embeddings. Our results show that relatively efficient, easy to implement privacy metrics can perform equally well as more advanced metrics explicitly modeling conditions for privacy referred to by the GDPR.

Contrastive Learning-Based privacy metrics in Tabular Synthetic Datasets

TL;DR

This work addresses privacy evaluation for tabular synthetic data by introducing a contrastive-learning embedding that yields a representative distance space in which standard distance-based privacy indicators (SRD/RRD) and empirical attack-based metrics can be applied consistently across heterogeneous attributes. The authors show that distance-to-closest-record and derived privacy scores, when computed in the contrastive embedding space, reproduce privacy signals comparable to attack-based measures while offering greater efficiency and lower variance. Across Adult, Texas, and Census datasets, the embedding-enhanced DCR often matches or exceeds the performance of explicit attacks and demonstrates substantially faster computation, though dataset-specific behavior (e.g., embedded-space failures on certain datasets) highlights that embedding benefits are data-dependent. The study suggests practical pathways for scalable privacy assessment of synthetic tabular data and outlines future work to refine when and how embeddings improve privacy metrics, including integration with other anomaly/outlier detection methods and evaluation on larger-scale data.

Abstract

Synthetic data has garnered attention as a Privacy Enhancing Technology (PET) in sectors such as healthcare and finance. When using synthetic data in practical applications, it is important to provide protection guarantees. In the literature, two family of approaches are proposed for tabular data: on the one hand, Similarity-based methods aim at finding the level of similarity between training and synthetic data. Indeed, a privacy breach can occur if the generated data is consistently too similar or even identical to the train data. On the other hand, Attack-based methods conduce deliberate attacks on synthetic datasets. The success rates of these attacks reveal how secure the synthetic datasets are. In this paper, we introduce a contrastive method that improves privacy assessment of synthetic datasets by embedding the data in a more representative space. This overcomes obstacles surrounding the multitude of data types and attributes. It also makes the use of intuitive distance metrics possible for similarity measurements and as an attack vector. In a series of experiments with publicly available datasets, we compare the performances of similarity-based and attack-based methods, both with and without use of the contrastive learning-based embeddings. Our results show that relatively efficient, easy to implement privacy metrics can perform equally well as more advanced metrics explicitly modeling conditions for privacy referred to by the GDPR.

Paper Structure

This paper contains 21 sections, 8 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Contrastive learning approach
  • Figure 2: Leaky evaluation, where $\varepsilon$ denotes noise addition
  • Figure 3: Leaky evaluation results with 95% confidence interval error bars, by paramterization and dataset. The tuples in the labels of the $y$-axes indicate parameter values of $(\sigma,\lambda,p)$.
  • Figure 4: DCR and Anonymeter result with REalTabFormer synthetization. Confidence intervals for DCR are computed via bootstrapping with $n=1000$.
  • Figure 5: Validation loss and accuracy for different embedding dimensions