Table of Contents
Fetching ...

A Linear Reconstruction Approach for Attribute Inference Attacks against Synthetic Data

Meenatchi Sundaram Muthu Selva Annamalai, Andrea Gadotti, Luc Rocher

TL;DR

This work investigates privacy risks in synthetic data by introducing a new no-box attribute inference attack based on linear reconstruction that leverages preserved marginal statistics. It formalizes an Attribute Inference Privacy Game to measure individual leakage, and demonstrates that many state-of-the-art SDG methods fail to protect arbitrary records, with differential privacy offering protection only in specific settings. The attack uses a linear program over 3-way marginals (with conditional variants) to reconstruct secret attributes, and it outperforms prior attacks in several scenarios, especially on select–measure–generate SDG methods. The study also reveals a nuanced privacy-utility tradeoff: releasing more synthetic data can improve utility but dramatically increase attack efficacy, while DP mechanisms can reduce risk at the cost of utility, depending on the algorithm and dataset. Overall, the results suggest that synthetic data alone do not guarantee robust privacy and that rigorous, context-specific privacy evaluations are essential for practical deployment.

Abstract

Recent advances in synthetic data generation (SDG) have been hailed as a solution to the difficult problem of sharing sensitive data while protecting privacy. SDG aims to learn statistical properties of real data in order to generate "artificial" data that are structurally and statistically similar to sensitive data. However, prior research suggests that inference attacks on synthetic data can undermine privacy, but only for specific outlier records. In this work, we introduce a new attribute inference attack against synthetic data. The attack is based on linear reconstruction methods for aggregate statistics, which target all records in the dataset, not only outliers. We evaluate our attack on state-of-the-art SDG algorithms, including Probabilistic Graphical Models, Generative Adversarial Networks, and recent differentially private SDG mechanisms. By defining a formal privacy game, we show that our attack can be highly accurate even on arbitrary records, and that this is the result of individual information leakage (as opposed to population-level inference). We then systematically evaluate the tradeoff between protecting privacy and preserving statistical utility. Our findings suggest that current SDG methods cannot consistently provide sufficient privacy protection against inference attacks while retaining reasonable utility. The best method evaluated, a differentially private SDG mechanism, can provide both protection against inference attacks and reasonable utility, but only in very specific settings. Lastly, we show that releasing a larger number of synthetic records can improve utility but at the cost of making attacks far more effective.

A Linear Reconstruction Approach for Attribute Inference Attacks against Synthetic Data

TL;DR

This work investigates privacy risks in synthetic data by introducing a new no-box attribute inference attack based on linear reconstruction that leverages preserved marginal statistics. It formalizes an Attribute Inference Privacy Game to measure individual leakage, and demonstrates that many state-of-the-art SDG methods fail to protect arbitrary records, with differential privacy offering protection only in specific settings. The attack uses a linear program over 3-way marginals (with conditional variants) to reconstruct secret attributes, and it outperforms prior attacks in several scenarios, especially on select–measure–generate SDG methods. The study also reveals a nuanced privacy-utility tradeoff: releasing more synthetic data can improve utility but dramatically increase attack efficacy, while DP mechanisms can reduce risk at the cost of utility, depending on the algorithm and dataset. Overall, the results suggest that synthetic data alone do not guarantee robust privacy and that rigorous, context-specific privacy evaluations are essential for practical deployment.

Abstract

Recent advances in synthetic data generation (SDG) have been hailed as a solution to the difficult problem of sharing sensitive data while protecting privacy. SDG aims to learn statistical properties of real data in order to generate "artificial" data that are structurally and statistically similar to sensitive data. However, prior research suggests that inference attacks on synthetic data can undermine privacy, but only for specific outlier records. In this work, we introduce a new attribute inference attack against synthetic data. The attack is based on linear reconstruction methods for aggregate statistics, which target all records in the dataset, not only outliers. We evaluate our attack on state-of-the-art SDG algorithms, including Probabilistic Graphical Models, Generative Adversarial Networks, and recent differentially private SDG mechanisms. By defining a formal privacy game, we show that our attack can be highly accurate even on arbitrary records, and that this is the result of individual information leakage (as opposed to population-level inference). We then systematically evaluate the tradeoff between protecting privacy and preserving statistical utility. Our findings suggest that current SDG methods cannot consistently provide sufficient privacy protection against inference attacks while retaining reasonable utility. The best method evaluated, a differentially private SDG mechanism, can provide both protection against inference attacks and reasonable utility, but only in very specific settings. Lastly, we show that releasing a larger number of synthetic records can improve utility but at the cost of making attacks far more effective.
Paper Structure (54 sections, 9 equations, 21 figures, 14 tables)

This paper contains 54 sections, 9 equations, 21 figures, 14 tables.

Figures (21)

  • Figure 1: Receiver operating characteristic (ROC) curves of $\text{Adv}_\textit{recon}$ for a synthetic data size $m = 10^6$.
  • Figure 2: Comparison of attack accuracy (mean $\pm$ s.d.) between $\text{Adv}_\textit{recon}$ and prior attacks $\text{Adv}_\textit{dcr}$ and $\text{Adv}_\textit{infer}$ for a synthetic data size $m = 10^6$.
  • Figure 3: Attack accuracy of $\text{Adv}_\textit{recon}$ for a synthetic data size from $m=10$ to $10^6$.
  • Figure 4: Measurement errors for synthetic data sizes $m=10$ to $10^6$.
  • Figure 5: Tradeoff plot between privacy (attack accuracy) and utility (measurement error).
  • ...and 16 more figures

Theorems & Definitions (3)

  • Definition 3.1: $k$-way marginal query
  • Definition 5.1: Total Variation Distance
  • Definition 5.2: Average $k$-Total Variation Distance