PRIVET: Privacy Metric Based on Extreme Value Theory
Antoine Szatkownik, Aurélien Decelle, Beatriz Seoane, Nicolas Bereux, Léo Planche, Guillaume Charpiat, Burak Yelmen, Flora Jay, Cyril Furtlehner
TL;DR
This work introduces PRIVET, an EVT-based privacy metric that uses the tails of nearest-neighbor distance distributions to detect memorization and privacy leakage in synthetic data. By fitting either Weibull or Gumbel tail models to NN distances and computing sample-level scores $\Delta\pi_r$ via $P_{N,M}(u,r)$, PRIVET provides interpretable per-sample leak flags and global leakage indices such as $NPL$. The authors validate PRIVET on genetic SNP data and image data, showing robust detection of leakage across high-dimensional and low-sample regimes and under various transformations, with favorable comparisons to existing dataset- and sample-level metrics. The method is modular, scalable, and domain-agnostic, enabling practical privacy auditing of synthetic data in biomedical and computer-vision contexts. Overall, PRIVET advances privacy evaluation by delivering both fine-grained (sample-level) and coarse-grained (dataset-level) assessments grounded in extreme value theory.
Abstract
Deep generative models are often trained on sensitive data, such as genetic sequences, health data, or more broadly, any copyrighted, licensed or protected content. This raises critical concerns around privacy-preserving synthetic data, and more specifically around privacy leakage, an issue closely tied to overfitting. Existing methods almost exclusively rely on global criteria to estimate the risk of privacy failure associated to a model, offering only quantitative non interpretable insights. The absence of rigorous evaluation methods for data privacy at the sample-level may hinder the practical deployment of synthetic data in real-world applications. Using extreme value statistics on nearest-neighbor distances, we propose PRIVET, a generic sample-based, modality-agnostic algorithm that assigns an individual privacy leak score to each synthetic sample. We empirically demonstrate that PRIVET reliably detects instances of memorization and privacy leakage across diverse data modalities, including settings with very high dimensionality, limited sample sizes such as genetic data and even under underfitting regimes. We compare our method to existing approaches under controlled settings and show its advantage in providing both dataset level and sample level assessments through qualitative and quantitative outputs. Additionally, our analysis reveals limitations in existing computer vision embeddings to yield perceptually meaningful distances when identifying near-duplicate samples.
