Privacy Vulnerabilities in Marginals-based Synthetic Data

Steven Golob; Sikha Pentyala; Anuar Maratkhan; Martine De Cock

Privacy Vulnerabilities in Marginals-based Synthetic Data

Steven Golob, Sikha Pentyala, Anuar Maratkhan, Martine De Cock

TL;DR

This paper addresses privacy vulnerabilities in marginals-based synthetic data under differential privacy by introducing MAMA-MIA, a lightweight membership inference attack that leverages auxiliary data and black-box knowledge of the SDG to build a density estimator $\zeta$ of the training data. By identifying focal-points through shadow modelling and aggregating them into $\zeta$, the method achieves competitive or superior inference accuracy compared to state-of-the-art MIAs while dramatically reducing computational requirements. The authors demonstrate effectiveness across representative marginals-based SDGs (MST, PrivBayes, Private-GSD) on the SNAKE and California Housing datasets, and discuss broader implications for privacy policy and SDG design. The work reveals that even DP-protected, marginals-preserving generators can leak individual information, motivating new defenses and auditing approaches for synthetic data systems.

Abstract

When acting as a privacy-enhancing technology, synthetic data generation (SDG) aims to maintain a resemblance to the real data while excluding personally-identifiable information. Many SDG algorithms provide robust differential privacy (DP) guarantees to this end. However, we show that the strongest class of SDG algorithms--those that preserve \textit{marginal probabilities}, or similar statistics, from the underlying data--leak information about individuals that can be recovered more efficiently than previously understood. We demonstrate this by presenting a novel membership inference attack, MAMA-MIA, and evaluate it against three seminal DP SDG algorithms: MST, PrivBayes, and Private-GSD. MAMA-MIA leverages knowledge of which SDG algorithm was used, allowing it to learn information about the hidden data more accurately, and orders-of-magnitude faster, than other leading attacks. We use MAMA-MIA to lend insight into existing SDG vulnerabilities. Our approach went on to win the first SNAKE (SaNitization Algorithm under attacK ... $\varepsilon$) competition.

Privacy Vulnerabilities in Marginals-based Synthetic Data

TL;DR

of the training data. By identifying focal-points through shadow modelling and aggregating them into

, the method achieves competitive or superior inference accuracy compared to state-of-the-art MIAs while dramatically reducing computational requirements. The authors demonstrate effectiveness across representative marginals-based SDGs (MST, PrivBayes, Private-GSD) on the SNAKE and California Housing datasets, and discuss broader implications for privacy policy and SDG design. The work reveals that even DP-protected, marginals-preserving generators can leak individual information, motivating new defenses and auditing approaches for synthetic data systems.

Abstract

) competition.

Paper Structure (24 sections, 8 equations, 16 figures, 8 tables, 1 algorithm)

This paper contains 24 sections, 8 equations, 16 figures, 8 tables, 1 algorithm.

Introduction
Preliminaries and related work
Synthetic data generation
Membership inference
Related MIAs
Problem description
MAMA-MIA: the MArginals Measurment Aggregation-based Membership Inference Attack
Shadow modelling in MAMA-MIA
Estimating the densities $\zeta$ on marginals-based synthetic data
Adaptation of MAMA-MIA to each SDG algorithm
Experiments and results
Experiment A results
Experiment B results
Experiment C results
Additional results
...and 9 more sections

Figures (16)

Figure 1: Membership Inference Overview. The SDG algorithm takes $D_{train}$ as input, and produces $D_{synth}$. While an attacker has access to $D_{aux}$ and $D_{synth}$, $D_{train}$ remains hidden. The goal of the attack is to detect which records in $D_{target}$ are included in $D_{train}$.
Figure 2: A simple visualization of how DOMIAS and MAMA-MIA detect overfitting. $S$ gives some density estimation of $D_{synth}$ and $D_{aux}$ (left). Normalizing $S(D_{synth})$ by $S(D_{aux})$ exposes overfitting on $D_{train}$ (right).
Figure 3: Experiment A The accuracy of membership inference performed with MAMA-MIA is at par with, or higher than, the accuracy of state-of-the-art membership inference attacks TAPAS houssiau2022tapas and DOMIAS with BNAF/KDE van2023membership. The results are consistent regardless of the SDG algorithm used to generate the synthetic data, i.e. MST mckenna2021winning, PrivBayes zhang2017privbayes, and Private-GSD liu2023generating.
Figure 4: Experiment B Trialing size configurations (i.) - (vi.) in Table \ref{['tab:dataset_sizes']} on the SNAKE data, when $\varepsilon = 10$. (We omit $|D_{train}| = 31,623$ results for Private-GSD and for the DOMIAS+BNAF attack due to computational limitations.)
Figure 5: Conducting membership inference attacks on the housing data.
...and 11 more figures

Theorems & Definitions (1)

Definition 3.1

Privacy Vulnerabilities in Marginals-based Synthetic Data

TL;DR

Abstract

Privacy Vulnerabilities in Marginals-based Synthetic Data

Authors

TL;DR

Abstract

Table of Contents

Figures (16)

Theorems & Definitions (1)