Synthetic Data, Similarity-based Privacy Metrics, and Regulatory (Non-)Compliance
Georgi Ganev
TL;DR
The paper critically examines similarity-based privacy metrics (SBPMs) for synthetic data, arguing they do not guarantee regulatory compliance and fail to address worst-case threats like singling out, linkability, or the motivated intruder. It defines SBPMs using distances between training, test, and synthetic data and demonstrates, through counter-examples on a simple 2d Gaussian dataset, that SBPMs can pass privacy tests yet still leak sensitive information, with pass rates that are inconsistent across runs. The authors analyze theoretical gaps (no threat model, binary privacy interpretation, non-contrastive evaluation) and practical issues (discretization, train/test splits) and critique three proposed countermeasures (DP-trained generators, DP-ifying metrics, and hiding metrics). They advocate for approaches that address formal privacy guarantees and adversarial risk assessment, while acknowledging their own challenges, and highlight the importance of empirical privacy attacks for auditing and regulatory relevance.
Abstract
In this paper, we argue that similarity-based privacy metrics cannot ensure regulatory compliance of synthetic data. Our analysis and counter-examples show that they do not protect against singling out and linkability and, among other fundamental issues, completely ignore the motivated intruder test.
