Table of Contents
Fetching ...

Understanding the Impact of Data Domain Extraction on Synthetic Data Privacy

Georgi Ganev, Meenatchi Sundaram Muthu Selva Annamalai, Sofiane Mahiou, Emiliano De Cristofaro

TL;DR

This work analyzes how the definition of the data domain in DP synthetic tabular data pipelines influences privacy against membership inference attacks. It compares three domain extraction strategies—provided domain, direct input extraction, and DP-based extraction—across two DP generative models (PrivBayes and MST) and four DP discretization methods, evaluated with GroundHog attacks on the Wine dataset under various budgets ($\epsilon$ and $\delta$). The key finding is that direct domain extraction breaks end-to-end DP and enables strong leakage, while both provided-domain and DP-domain extraction substantially mitigate attacks, with DP extraction remaining protective even at large budgets ($\epsilon$ up to 100 or 1000) though potentially affecting utility. The results suggest that many DP vulnerabilities in open-source implementations may stem from domain extraction choices rather than model weaknesses, underscoring the need for DP-compliant domain handling in libraries and further study of the utility-privacy implications of DP domain extraction.

Abstract

Privacy attacks, particularly membership inference attacks (MIAs), are widely used to assess the privacy of generative models for tabular synthetic data, including those with Differential Privacy (DP) guarantees. These attacks often exploit outliers, which are especially vulnerable due to their position at the boundaries of the data domain (e.g., at the minimum and maximum values). However, the role of data domain extraction in generative models and its impact on privacy attacks have been overlooked. In this paper, we examine three strategies for defining the data domain: assuming it is externally provided (ideally from public data), extracting it directly from the input data, and extracting it with DP mechanisms. While common in popular implementations and libraries, we show that the second approach breaks end-to-end DP guarantees and leaves models vulnerable. While using a provided domain (if representative) is preferable, extracting it with DP can also defend against popular MIAs, even at high privacy budgets.

Understanding the Impact of Data Domain Extraction on Synthetic Data Privacy

TL;DR

This work analyzes how the definition of the data domain in DP synthetic tabular data pipelines influences privacy against membership inference attacks. It compares three domain extraction strategies—provided domain, direct input extraction, and DP-based extraction—across two DP generative models (PrivBayes and MST) and four DP discretization methods, evaluated with GroundHog attacks on the Wine dataset under various budgets ( and ). The key finding is that direct domain extraction breaks end-to-end DP and enables strong leakage, while both provided-domain and DP-domain extraction substantially mitigate attacks, with DP extraction remaining protective even at large budgets ( up to 100 or 1000) though potentially affecting utility. The results suggest that many DP vulnerabilities in open-source implementations may stem from domain extraction choices rather than model weaknesses, underscoring the need for DP-compliant domain handling in libraries and further study of the utility-privacy implications of DP domain extraction.

Abstract

Privacy attacks, particularly membership inference attacks (MIAs), are widely used to assess the privacy of generative models for tabular synthetic data, including those with Differential Privacy (DP) guarantees. These attacks often exploit outliers, which are especially vulnerable due to their position at the boundaries of the data domain (e.g., at the minimum and maximum values). However, the role of data domain extraction in generative models and its impact on privacy attacks have been overlooked. In this paper, we examine three strategies for defining the data domain: assuming it is externally provided (ideally from public data), extracting it directly from the input data, and extracting it with DP mechanisms. While common in popular implementations and libraries, we show that the second approach breaks end-to-end DP guarantees and leaves models vulnerable. While using a provided domain (if representative) is preferable, extracting it with DP can also defend against popular MIAs, even at high privacy budgets.

Paper Structure

This paper contains 6 sections, 4 figures.

Figures (4)

  • Figure 1: Privacy leakage with provided and extracted domain (w/ and w/o DP) for the four DP discretizers ($\epsilon=1$) and two DP generative models ($\epsilon=1$) on a target record outside the domain of the remaining data.
  • Figure 2: Privacy leakage with provided domain and extracted domain (w/ and w/o DP) of the four DP discretizers ($\epsilon=1 \text{ or } 100$) and two DP generative models ($\epsilon=1 \text{ or } 100$) on a target record outside the domain of the remaining data.
  • Figure 3: Privacy leakage with provided domain and extracted domain (w/ and w/o DP) of the four DP discretizers ($\epsilon=1 \text{ or } 1,000$) and two DP generative models ($\epsilon=1 \text{ or } 1,000$) on a target record outside the domain of the remaining data.
  • Figure 4: Privacy leakage with provided domain and extracted domain (w/ and w/o DP) of the four DP discretizers ($\epsilon=1, 100 \text{ or } 1,000$) and two DP generative models ($\epsilon=1, 100 \text{ or } 1,000$) on a target record inside the domain of the remaining data.