Table of Contents
Fetching ...

Synthetic Data Outliers: Navigating Identity Disclosure

Carolina Trindade, Luís Antunes, Tânia Carvalho, Nuno Moniz

TL;DR

The paper investigates re-identification risks for synthetic data with a focus on outliers, showing that linkage attacks can re-identify extreme records even when data are synthetic. By generating numerous variants with deep-learning (SDV) and differential-privacy-based (DPART) methods and evaluating data utility and re-identification risk via a linkage attack, it demonstrates a model-dependent privacy-utility tradeoff: DP methods reduce linkage risk but degrade utility, while DL-based methods preserve utility at privacy risk for outliers. The study uses the Credit Risk dataset and evaluates 102 synthetic variants across multiple models, providing actionable insights into hyperparameter tuning, model choice, and the balance between data usefulness and privacy protection. It highlights practical implications for deploying synthetic data in privacy-sensitive domains and underscores the need for attack-aware synthesis strategies to prevent outlier leakage.

Abstract

Multiple synthetic data generation models have emerged, among which deep learning models have become the vanguard due to their ability to capture the underlying characteristics of the original data. However, the resemblance of the synthetic to the original data raises important questions on the protection of individuals' privacy. As synthetic data is perceived as a means to fully protect personal information, most current related work disregards the impact of re-identification risk. In particular, limited attention has been given to exploring outliers, despite their privacy relevance. In this work, we analyze the privacy of synthetic data w.r.t the outliers. Our main findings suggest that outliers re-identification via linkage attack is feasible and easily achieved. Furthermore, additional safeguards such as differential privacy can prevent re-identification, albeit at the expense of the data utility.

Synthetic Data Outliers: Navigating Identity Disclosure

TL;DR

The paper investigates re-identification risks for synthetic data with a focus on outliers, showing that linkage attacks can re-identify extreme records even when data are synthetic. By generating numerous variants with deep-learning (SDV) and differential-privacy-based (DPART) methods and evaluating data utility and re-identification risk via a linkage attack, it demonstrates a model-dependent privacy-utility tradeoff: DP methods reduce linkage risk but degrade utility, while DL-based methods preserve utility at privacy risk for outliers. The study uses the Credit Risk dataset and evaluates 102 synthetic variants across multiple models, providing actionable insights into hyperparameter tuning, model choice, and the balance between data usefulness and privacy protection. It highlights practical implications for deploying synthetic data in privacy-sensitive domains and underscores the need for attack-aware synthesis strategies to prevent outlier leakage.

Abstract

Multiple synthetic data generation models have emerged, among which deep learning models have become the vanguard due to their ability to capture the underlying characteristics of the original data. However, the resemblance of the synthetic to the original data raises important questions on the protection of individuals' privacy. As synthetic data is perceived as a means to fully protect personal information, most current related work disregards the impact of re-identification risk. In particular, limited attention has been given to exploring outliers, despite their privacy relevance. In this work, we analyze the privacy of synthetic data w.r.t the outliers. Our main findings suggest that outliers re-identification via linkage attack is feasible and easily achieved. Furthermore, additional safeguards such as differential privacy can prevent re-identification, albeit at the expense of the data utility.
Paper Structure (16 sections, 1 equation, 7 figures, 5 tables)

This paper contains 16 sections, 1 equation, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Example of the distribution of the attribute person_age in the original dataset compared to a DPsynthpop variant.
  • Figure 2: AttributeCoverage of the synthetic dataset variants generated with deep learning-based models (left) and differential privacy-based models (right).
  • Figure 3: StatisticSimilarity of the synthetic dataset variants generated with deep learning-based models (left) and differential privacy-based models (right).
  • Figure 4: Distribution of records that are possible matches regarding the person_age and person_income attributes (left) and also including person_home_ownership and loan_intent attributes (right) with one possible match highlighted.
  • Figure 5: Possible matches for the synthetic dataset variants generated with deep learning-based models (left) and differential privacy-based models (right).
  • ...and 2 more figures

Theorems & Definitions (2)

  • definition thmcounterdefinition: Outlier detection
  • definition thmcounterdefinition: Differential Privacy dwork2006differential