Table of Contents
Fetching ...

Risk-Equalized Differentially Private Synthetic Data: Protecting Outliers by Controlling Record-Level Influence

Amir Asiaee, Chao Yan, Zachary B. Abrams, Bradley A. Malin

TL;DR

This work addresses the unequal privacy risk faced by outliers in differentially private synthetic data by proposing risk-equalized DP synthesis (REPS). It combines a private outlier scorer with a risk-weighted learning stage, ensuring that high-risk records contribute less to the learned generator, thereby tightening per-record privacy bounds under Gaussian mechanisms while maintaining overall DP guarantees. Theoretical results connect record-level influence to per-instance privacy, and experiments on simulated and real tabular datasets show reductions in top-decile membership inference risk and varied utility effects depending on scorer quality and dataset. The approach offers a practical mechanism to improve privacy equity in synthetic data releases, with applicability to DP-SGD and broader data-sharing contexts, though performance is dataset-dependent and sensitive to scorer accuracy.

Abstract

When synthetic data is released, some individuals are harder to protect than others. A patient with a rare disease combination or a transaction with unusual characteristics stands out from the crowd. Differential privacy provides worst-case guarantees, but empirical attacks -- particularly membership inference -- succeed far more often against such outliers, especially under moderate privacy budgets and with auxiliary information. This paper introduces risk-equalized DP synthesis, a framework that prioritizes protection for high-risk records by reducing their influence on the learned generator. The mechanism operates in two stages: first, a small privacy budget estimates each record's "outlierness"; second, a DP learning procedure weights each record inversely to its risk score. Under Gaussian mechanisms, a record's privacy loss is proportional to its influence on the output -- so deliberately shrinking outliers' contributions yields tighter per-instance privacy bounds for precisely those records that need them most. We prove end-to-end DP guarantees via composition and derive closed-form per-record bounds for the synthesis stage (the scoring stage adds a uniform per-record term). Experiments on simulated data with controlled outlier injection show that risk-weighting substantially reduces membership inference success against high-outlierness records; ablations confirm that targeting -- not random downweighting -- drives the improvement. On real-world benchmarks (Breast Cancer, Adult, German Credit), gains are dataset-dependent, highlighting the interplay between scorer quality and synthesis pipeline.

Risk-Equalized Differentially Private Synthetic Data: Protecting Outliers by Controlling Record-Level Influence

TL;DR

This work addresses the unequal privacy risk faced by outliers in differentially private synthetic data by proposing risk-equalized DP synthesis (REPS). It combines a private outlier scorer with a risk-weighted learning stage, ensuring that high-risk records contribute less to the learned generator, thereby tightening per-record privacy bounds under Gaussian mechanisms while maintaining overall DP guarantees. Theoretical results connect record-level influence to per-instance privacy, and experiments on simulated and real tabular datasets show reductions in top-decile membership inference risk and varied utility effects depending on scorer quality and dataset. The approach offers a practical mechanism to improve privacy equity in synthetic data releases, with applicability to DP-SGD and broader data-sharing contexts, though performance is dataset-dependent and sensitive to scorer accuracy.

Abstract

When synthetic data is released, some individuals are harder to protect than others. A patient with a rare disease combination or a transaction with unusual characteristics stands out from the crowd. Differential privacy provides worst-case guarantees, but empirical attacks -- particularly membership inference -- succeed far more often against such outliers, especially under moderate privacy budgets and with auxiliary information. This paper introduces risk-equalized DP synthesis, a framework that prioritizes protection for high-risk records by reducing their influence on the learned generator. The mechanism operates in two stages: first, a small privacy budget estimates each record's "outlierness"; second, a DP learning procedure weights each record inversely to its risk score. Under Gaussian mechanisms, a record's privacy loss is proportional to its influence on the output -- so deliberately shrinking outliers' contributions yields tighter per-instance privacy bounds for precisely those records that need them most. We prove end-to-end DP guarantees via composition and derive closed-form per-record bounds for the synthesis stage (the scoring stage adds a uniform per-record term). Experiments on simulated data with controlled outlier injection show that risk-weighting substantially reduces membership inference success against high-outlierness records; ablations confirm that targeting -- not random downweighting -- drives the improvement. On real-world benchmarks (Breast Cancer, Adult, German Credit), gains are dataset-dependent, highlighting the interplay between scorer quality and synthesis pipeline.
Paper Structure (83 sections, 7 theorems, 15 equations, 6 figures, 4 tables, 1 algorithm)

This paper contains 83 sections, 7 theorems, 15 equations, 6 figures, 4 tables, 1 algorithm.

Key Result

Theorem 1

If Stage 1 scorer $\mathcal{W}$ is $(\varepsilon_s,\delta_s)$-DP and Stage 2 learner $\mathcal{A}(\cdot;w)$ is $(\varepsilon_t,\delta_t)$-DP for any fixed $w$, then Algorithm alg:redpsynth is $(\varepsilon_s+\varepsilon_t,\delta_s+\delta_t)$-DP.

Figures (6)

  • Figure 1: Utility (TSTR AUROC) versus privacy budget on the simulated dataset. REPS shows slightly lower utility---the expected cost of downweighting outliers to protect their privacy.
  • Figure 2: Top-10% MIA advantage versus privacy budget on the simulated dataset. Lower is better: REPS reduces attacker success against the most vulnerable records.
  • Figure 3: MIA advantage by outlier decile (DOMIAS-style), for Adult and Breast Cancer at $\varepsilon\in\{1.0,4.0\}$. Lower curves indicate better privacy; the gap between decile 10 and decile 1 reflects privacy inequity.
  • Figure 4: Scorer quality versus $\varepsilon$: Spearman correlation between DP histogram scores and non-private $k$NN outlier scores. Higher correlation indicates the DP scorer successfully identifies true outliers.
  • Figure 5: Scorer quality versus $\varepsilon$: Recall@Top-10% between the DP-score top decile and the $k$NN top decile. Higher recall means the DP scorer correctly flags the same high-risk records as the oracle.
  • ...and 1 more figures

Theorems & Definitions (20)

  • Definition 1: Differential Privacy dwork2006calibrating
  • Definition 2: Per-instance DP wang2019pidp
  • Definition 3: Outlier score
  • Theorem 1: Composition
  • Corollary 1: End-to-end per-instance bound
  • Definition 4: Record-level influence
  • Theorem 2: Influence-to-privacy bound
  • Remark 1: What is new
  • Lemma 1: Influence under weighting
  • Corollary 2: Per-instance bound for weighted release
  • ...and 10 more