Synthetic is all you need: removing the auxiliary data assumption for membership inference attacks against synthetic data

Florent Guépin; Matthieu Meeus; Ana-Maria Cretu; Yves-Alexandre de Montjoye

Synthetic is all you need: removing the auxiliary data assumption for membership inference attacks against synthetic data

Florent Guépin, Matthieu Meeus, Ana-Maria Cretu, Yves-Alexandre de Montjoye

TL;DR

This work addresses the privacy risks of releasing synthetic tabular data by removing the common auxiliary-data assumption used in membership inference attacks (MIAs). It develops three attack scenarios—Black-box, Published, and an Upper-bound—showing that MIAs remain effective using only synthetic data across two real datasets and two generators, with average accuracies around $65.5\%$ (S1) and $62.8\%$ (S2), and an upper-bound of $85.8\%$. It also identifies a double counting issue that can inflate attacks and formalizes an upper bound under synthetic-only information, suggesting that future attacks could outperform traditional auxiliary-data–based attacks. The findings challenge standard privacy audit practices for synthetic data and motivate development of defenses, including differentially private synthetic generation, to mitigate realistic leakage. Overall, the paper demonstrates that relying on auxiliary data is not a safe or sufficient assumption for assessing synthetic data privacy in practice, emphasizing the need for robust privacy-preserving generation methods and rigorous threat modeling.

Abstract

Synthetic data is emerging as one of the most promising solutions to share individual-level data while safeguarding privacy. While membership inference attacks (MIAs), based on shadow modeling, have become the standard to evaluate the privacy of synthetic data, they currently assume the attacker to have access to an auxiliary dataset sampled from a similar distribution as the training dataset. This is often seen as a very strong assumption in practice, especially as the proposed main use cases for synthetic tabular data (e.g. medical data, financial transactions) are very specific and don't have any reference datasets directly available. We here show how this assumption can be removed, allowing for MIAs to be performed using only the synthetic data. Specifically, we developed three different scenarios: (S1) Black-box access to the generator, (S2) only access to the released synthetic dataset and (S3) a theoretical setup as upper bound for the attack performance using only synthetic data. Our results show that MIAs are still successful, across two real-world datasets and two synthetic data generators. These results show how the strong hypothesis made when auditing synthetic data releases - access to an auxiliary dataset - can be relaxed, making the attacks more realistic in practice.

Synthetic is all you need: removing the auxiliary data assumption for membership inference attacks against synthetic data

TL;DR

(S1) and

(S2), and an upper-bound of

. It also identifies a double counting issue that can inflate attacks and formalizes an upper bound under synthetic-only information, suggesting that future attacks could outperform traditional auxiliary-data–based attacks. The findings challenge standard privacy audit practices for synthetic data and motivate development of defenses, including differentially private synthetic generation, to mitigate realistic leakage. Overall, the paper demonstrates that relying on auxiliary data is not a safe or sufficient assumption for assessing synthetic data privacy in practice, emphasizing the need for robust privacy-preserving generation methods and rigorous threat modeling.

Abstract

Paper Structure (24 sections, 4 figures, 3 tables)

This paper contains 24 sections, 4 figures, 3 tables.

Introduction
Background and Related Work
Synthetic data generation
Membership inference attacks against synthetic tabular data
Attack scenarios
(S0) Auxiliary
(S1) Black box
(S2) Published
(S3) Upper bound
Experimental Setup
Synthetic data generators
Real world datasets
Meta-classifier methods
Parameters of the attack
Results
...and 9 more sections

Figures (4)

Figure 1: Illustration of the shadow modeling technique
Figure 2: Comparison of MIA accuracy for the query based attack method across the 4 different scenarios (S0, S1, S2 and S3), for both generators Synthpop and BayNet. Figure (a) shows results for UK Census, while figure (b) displays results for Adult.
Figure 3: Comparison of MIA accuracy for the target attention attack method across the 4 different scenarios S0, S1, S2 and S3, for both generators Synthpop and BayNet. Figure (a) shows results for UK Census, while figure (b) displays results for Adult.
Figure 4: Mean and standard deviation of MIA accuracy for scenario (S1) Black-Box for varying number $m$ synthetic records available to the attacker. Results for BayNet and the query-based attack using (a) UK Census (b) Adult.

Theorems & Definitions (2)

definition thmcounterdefinition
definition thmcounterdefinition

Synthetic is all you need: removing the auxiliary data assumption for membership inference attacks against synthetic data

TL;DR

Abstract

Synthetic is all you need: removing the auxiliary data assumption for membership inference attacks against synthetic data

Authors

TL;DR

Abstract

Table of Contents

Figures (4)

Theorems & Definitions (2)