Table of Contents
Fetching ...

Synthetic is all you need: removing the auxiliary data assumption for membership inference attacks against synthetic data

Florent Guépin, Matthieu Meeus, Ana-Maria Cretu, Yves-Alexandre de Montjoye

TL;DR

This work addresses the privacy risks of releasing synthetic tabular data by removing the common auxiliary-data assumption used in membership inference attacks (MIAs). It develops three attack scenarios—Black-box, Published, and an Upper-bound—showing that MIAs remain effective using only synthetic data across two real datasets and two generators, with average accuracies around $65.5\%$ (S1) and $62.8\%$ (S2), and an upper-bound of $85.8\%$. It also identifies a double counting issue that can inflate attacks and formalizes an upper bound under synthetic-only information, suggesting that future attacks could outperform traditional auxiliary-data–based attacks. The findings challenge standard privacy audit practices for synthetic data and motivate development of defenses, including differentially private synthetic generation, to mitigate realistic leakage. Overall, the paper demonstrates that relying on auxiliary data is not a safe or sufficient assumption for assessing synthetic data privacy in practice, emphasizing the need for robust privacy-preserving generation methods and rigorous threat modeling.

Abstract

Synthetic data is emerging as one of the most promising solutions to share individual-level data while safeguarding privacy. While membership inference attacks (MIAs), based on shadow modeling, have become the standard to evaluate the privacy of synthetic data, they currently assume the attacker to have access to an auxiliary dataset sampled from a similar distribution as the training dataset. This is often seen as a very strong assumption in practice, especially as the proposed main use cases for synthetic tabular data (e.g. medical data, financial transactions) are very specific and don't have any reference datasets directly available. We here show how this assumption can be removed, allowing for MIAs to be performed using only the synthetic data. Specifically, we developed three different scenarios: (S1) Black-box access to the generator, (S2) only access to the released synthetic dataset and (S3) a theoretical setup as upper bound for the attack performance using only synthetic data. Our results show that MIAs are still successful, across two real-world datasets and two synthetic data generators. These results show how the strong hypothesis made when auditing synthetic data releases - access to an auxiliary dataset - can be relaxed, making the attacks more realistic in practice.

Synthetic is all you need: removing the auxiliary data assumption for membership inference attacks against synthetic data

TL;DR

This work addresses the privacy risks of releasing synthetic tabular data by removing the common auxiliary-data assumption used in membership inference attacks (MIAs). It develops three attack scenarios—Black-box, Published, and an Upper-bound—showing that MIAs remain effective using only synthetic data across two real datasets and two generators, with average accuracies around (S1) and (S2), and an upper-bound of . It also identifies a double counting issue that can inflate attacks and formalizes an upper bound under synthetic-only information, suggesting that future attacks could outperform traditional auxiliary-data–based attacks. The findings challenge standard privacy audit practices for synthetic data and motivate development of defenses, including differentially private synthetic generation, to mitigate realistic leakage. Overall, the paper demonstrates that relying on auxiliary data is not a safe or sufficient assumption for assessing synthetic data privacy in practice, emphasizing the need for robust privacy-preserving generation methods and rigorous threat modeling.

Abstract

Synthetic data is emerging as one of the most promising solutions to share individual-level data while safeguarding privacy. While membership inference attacks (MIAs), based on shadow modeling, have become the standard to evaluate the privacy of synthetic data, they currently assume the attacker to have access to an auxiliary dataset sampled from a similar distribution as the training dataset. This is often seen as a very strong assumption in practice, especially as the proposed main use cases for synthetic tabular data (e.g. medical data, financial transactions) are very specific and don't have any reference datasets directly available. We here show how this assumption can be removed, allowing for MIAs to be performed using only the synthetic data. Specifically, we developed three different scenarios: (S1) Black-box access to the generator, (S2) only access to the released synthetic dataset and (S3) a theoretical setup as upper bound for the attack performance using only synthetic data. Our results show that MIAs are still successful, across two real-world datasets and two synthetic data generators. These results show how the strong hypothesis made when auditing synthetic data releases - access to an auxiliary dataset - can be relaxed, making the attacks more realistic in practice.
Paper Structure (24 sections, 4 figures, 3 tables)

This paper contains 24 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Illustration of the shadow modeling technique
  • Figure 2: Comparison of MIA accuracy for the query based attack method across the 4 different scenarios (S0, S1, S2 and S3), for both generators Synthpop and BayNet. Figure (a) shows results for UK Census, while figure (b) displays results for Adult.
  • Figure 3: Comparison of MIA accuracy for the target attention attack method across the 4 different scenarios S0, S1, S2 and S3, for both generators Synthpop and BayNet. Figure (a) shows results for UK Census, while figure (b) displays results for Adult.
  • Figure 4: Mean and standard deviation of MIA accuracy for scenario (S1) Black-Box for varying number $m$ synthetic records available to the attacker. Results for BayNet and the query-based attack using (a) UK Census (b) Adult.

Theorems & Definitions (2)

  • definition thmcounterdefinition
  • definition thmcounterdefinition