Synthetic Data Privacy Metrics
Amy Steier, Lipika Ramaswamy, Andre Manoel, Alexa Haushalter
TL;DR
This paper addresses the lack of standardized empirical privacy metrics for synthetic data by surveying a broad range of approaches, from classic privacy notions like K-Anonymity to adversarially motivated attacks such as MIAs and AIAs. It documents how distance-based metrics (DCR, NNDR) and attack-based metrics (MIAs, AIAs) quantify leakage, and discusses privacy-enhancing techniques including differential privacy, pseudonymization, privacy filters, and architecture-level protections. The work synthesizes strengths and limitations of these metrics and methods, offering practical guidance on selecting metrics and applying privacy-preserving practices in synthetic data workflows. Overall, the paper aims to provide a comprehensive framework for evaluating and improving the privacy of synthetic data without sacrificing utility, aiding researchers, practitioners, and policy-makers in responsible data sharing.
Abstract
Recent advancements in generative AI have made it possible to create synthetic datasets that can be as accurate as real-world data for training AI models, powering statistical insights, and fostering collaboration with sensitive datasets while offering strong privacy guarantees. Effectively measuring the empirical privacy of synthetic data is an important step in the process. However, while there is a multitude of new privacy metrics being published every day, there currently is no standardization. In this paper, we review the pros and cons of popular metrics that include simulations of adversarial attacks. We also review current best practices for amending generative models to enhance the privacy of the data they create (e.g. differential privacy).
