Table of Contents
Fetching ...

Synthetic Data, Similarity-based Privacy Metrics, and Regulatory (Non-)Compliance

Georgi Ganev

TL;DR

The paper critically examines similarity-based privacy metrics (SBPMs) for synthetic data, arguing they do not guarantee regulatory compliance and fail to address worst-case threats like singling out, linkability, or the motivated intruder. It defines SBPMs using distances between training, test, and synthetic data and demonstrates, through counter-examples on a simple 2d Gaussian dataset, that SBPMs can pass privacy tests yet still leak sensitive information, with pass rates that are inconsistent across runs. The authors analyze theoretical gaps (no threat model, binary privacy interpretation, non-contrastive evaluation) and practical issues (discretization, train/test splits) and critique three proposed countermeasures (DP-trained generators, DP-ifying metrics, and hiding metrics). They advocate for approaches that address formal privacy guarantees and adversarial risk assessment, while acknowledging their own challenges, and highlight the importance of empirical privacy attacks for auditing and regulatory relevance.

Abstract

In this paper, we argue that similarity-based privacy metrics cannot ensure regulatory compliance of synthetic data. Our analysis and counter-examples show that they do not protect against singling out and linkability and, among other fundamental issues, completely ignore the motivated intruder test.

Synthetic Data, Similarity-based Privacy Metrics, and Regulatory (Non-)Compliance

TL;DR

The paper critically examines similarity-based privacy metrics (SBPMs) for synthetic data, arguing they do not guarantee regulatory compliance and fail to address worst-case threats like singling out, linkability, or the motivated intruder. It defines SBPMs using distances between training, test, and synthetic data and demonstrates, through counter-examples on a simple 2d Gaussian dataset, that SBPMs can pass privacy tests yet still leak sensitive information, with pass rates that are inconsistent across runs. The authors analyze theoretical gaps (no threat model, binary privacy interpretation, non-contrastive evaluation) and practical issues (discretization, train/test splits) and critique three proposed countermeasures (DP-trained generators, DP-ifying metrics, and hiding metrics). They advocate for approaches that address formal privacy guarantees and adversarial risk assessment, while acknowledging their own challenges, and highlight the importance of empirical privacy attacks for auditing and regulatory relevance.

Abstract

In this paper, we argue that similarity-based privacy metrics cannot ensure regulatory compliance of synthetic data. Our analysis and counter-examples show that they do not protect against singling out and linkability and, among other fundamental issues, completely ignore the motivated intruder test.
Paper Structure (6 sections, 2 figures)

This paper contains 6 sections, 2 figures.

Figures (2)

  • Figure 1: Data flow overview.
  • Figure 2: 2d Gauss data counter-examples.