Understanding the Impact of Data Domain Extraction on Synthetic Data Privacy
Georgi Ganev, Meenatchi Sundaram Muthu Selva Annamalai, Sofiane Mahiou, Emiliano De Cristofaro
TL;DR
This work analyzes how the definition of the data domain in DP synthetic tabular data pipelines influences privacy against membership inference attacks. It compares three domain extraction strategies—provided domain, direct input extraction, and DP-based extraction—across two DP generative models (PrivBayes and MST) and four DP discretization methods, evaluated with GroundHog attacks on the Wine dataset under various budgets ($\epsilon$ and $\delta$). The key finding is that direct domain extraction breaks end-to-end DP and enables strong leakage, while both provided-domain and DP-domain extraction substantially mitigate attacks, with DP extraction remaining protective even at large budgets ($\epsilon$ up to 100 or 1000) though potentially affecting utility. The results suggest that many DP vulnerabilities in open-source implementations may stem from domain extraction choices rather than model weaknesses, underscoring the need for DP-compliant domain handling in libraries and further study of the utility-privacy implications of DP domain extraction.
Abstract
Privacy attacks, particularly membership inference attacks (MIAs), are widely used to assess the privacy of generative models for tabular synthetic data, including those with Differential Privacy (DP) guarantees. These attacks often exploit outliers, which are especially vulnerable due to their position at the boundaries of the data domain (e.g., at the minimum and maximum values). However, the role of data domain extraction in generative models and its impact on privacy attacks have been overlooked. In this paper, we examine three strategies for defining the data domain: assuming it is externally provided (ideally from public data), extracting it directly from the input data, and extracting it with DP mechanisms. While common in popular implementations and libraries, we show that the second approach breaks end-to-end DP guarantees and leaves models vulnerable. While using a provided domain (if representative) is preferable, extracting it with DP can also defend against popular MIAs, even at high privacy budgets.
