High Epsilon Synthetic Data Vulnerabilities in MST and PrivBayes

Steven Golob; Sikha Pentyala; Anuar Maratkhan; Martine De Cock

High Epsilon Synthetic Data Vulnerabilities in MST and PrivBayes

Steven Golob, Sikha Pentyala, Anuar Maratkhan, Martine De Cock

TL;DR

The paper demonstrates that high differential privacy budgets ($\varepsilon$) can enable unambiguous membership inference attacks on state-of-the-art DP-SDGs MST and PrivBayes. It extends the DOMIAS framework with a black-box attack and auxiliary-data assumptions, leveraging shadow modelling to identify focal-points and construct a problem-specific density estimator $S$, enabling accurate inference on synthetic data membership. Experimental results show increasing attack efficacy with $\varepsilon$, achieving high membership-advantage scores especially for PrivBayes at $\varepsilon=1000$, highlighting practical privacy risks. The findings motivate stronger defenses for DP-SDGs and call for careful consideration of privacy-utility trade-offs in real-world deployments.

Abstract

Synthetic data generation (SDG) has become increasingly popular as a privacy-enhancing technology. It aims to maintain important statistical properties of its underlying training data, while excluding any personally identifiable information. There have been a whole host of SDG algorithms developed in recent years to improve and balance both of these aims. Many of these algorithms provide robust differential privacy guarantees. However, we show here that if the differential privacy parameter $\varepsilon$ is set too high, then unambiguous privacy leakage can result. We show this by conducting a novel membership inference attack (MIA) on two state-of-the-art differentially private SDG algorithms: MST and PrivBayes. Our work suggests that there are vulnerabilities in these generators not previously seen, and that future work to strengthen their privacy is advisable. We present the heuristic for our MIA here. It assumes knowledge of auxiliary "population" data, and also assumes knowledge of which SDG algorithm was used. We use this information to adapt the recent DOMIAS MIA uniquely to MST and PrivBayes. Our approach went on to win the SNAKE challenge in November 2023.

High Epsilon Synthetic Data Vulnerabilities in MST and PrivBayes

TL;DR

The paper demonstrates that high differential privacy budgets (

) can enable unambiguous membership inference attacks on state-of-the-art DP-SDGs MST and PrivBayes. It extends the DOMIAS framework with a black-box attack and auxiliary-data assumptions, leveraging shadow modelling to identify focal-points and construct a problem-specific density estimator

, enabling accurate inference on synthetic data membership. Experimental results show increasing attack efficacy with

, achieving high membership-advantage scores especially for PrivBayes at

, highlighting practical privacy risks. The findings motivate stronger defenses for DP-SDGs and call for careful consideration of privacy-utility trade-offs in real-world deployments.

Abstract

is set too high, then unambiguous privacy leakage can result. We show this by conducting a novel membership inference attack (MIA) on two state-of-the-art differentially private SDG algorithms: MST and PrivBayes. Our work suggests that there are vulnerabilities in these generators not previously seen, and that future work to strengthen their privacy is advisable. We present the heuristic for our MIA here. It assumes knowledge of auxiliary "population" data, and also assumes knowledge of which SDG algorithm was used. We use this information to adapt the recent DOMIAS MIA uniquely to MST and PrivBayes. Our approach went on to win the SNAKE challenge in November 2023.

Paper Structure (13 sections, 4 equations, 4 figures, 2 tables, 2 algorithms)

This paper contains 13 sections, 4 equations, 4 figures, 2 tables, 2 algorithms.

Introduction
DOMIAS Overview
Our Approach
Brief interpolation on inherent limitations
MIA on MST
MIA on PrivBayes
Activation Function
Experimental Results
Setup
Membership Advantage
Results
Discussion
Acknowledgments

Figures (4)

Figure 1: A simple visualization of how DOMIAS detects overfitting. (left) $S$ gives an estimation of the probability distributions of $D_{synth}$ and $D_{aux}$. (right) Normalizing $S(D_{synth})$ by $S(D_{aux})$ exposes overfitting on $D_{train}$.
Figure 2: Membership advantage scores using our novel heuristic on different values of $\varepsilon$, averaged over 50 runs
Figure 3: Frequencies at which marginals are selected, when shadow modelling MST across 50 runs, for various $\varepsilon$
Figure 4: Frequencies at which conditionals' parent sizes are selected in PrivBayes when shadow modelling across 50 runs, for various $\varepsilon$

High Epsilon Synthetic Data Vulnerabilities in MST and PrivBayes

TL;DR

Abstract

High Epsilon Synthetic Data Vulnerabilities in MST and PrivBayes

Authors

TL;DR

Abstract

Table of Contents

Figures (4)