Table of Contents
Fetching ...

SAFES: Sequential Privacy and Fairness Enhancing Data Synthesis for Responsible AI

Spencer Giddens, Xiaon Lang, Fang Liu

TL;DR

SAFES tackles the joint problem of privacy and fairness in synthetic data by sequentially applying DP data synthesis followed by fairness-aware preprocessing, yielding a general, modular framework compatible with multiple DP synthesizers (e.g., AIM, DP-CTGAN) and fairness preprocessors (e.g., TOT, RW). The approach demonstrates that, under reasonable privacy budgets, SAFES can provide improved fairness with limited utility loss, offering a practical pathway to responsible AI data releases. The work also discusses the inherent trade-offs among privacy, fairness, and utility, shows robustness under varied settings, and highlights scalability constraints that motivate future work on efficiency and broader fairness definitions. Overall, SAFES contributes a principled, flexible toolkit for producing privacy-preserving, fairness-aware synthetic data for downstream ML tasks with real-world impact in domains like lending and criminal justice.

Abstract

As data-driven and AI-based decision making gains widespread adoption across disciplines, it is crucial that both data privacy and decision fairness are appropriately addressed. Although differential privacy (DP) provides a robust framework for guaranteeing privacy and methods are available to improve fairness, most prior work treats the two concerns separately. Even though there are existing approaches that consider privacy and fairness simultaneously, they typically focus on a single specific learning task, limiting their generalizability. In response, we introduce SAFES, a Sequential PrivAcy and Fairness Enhancing data Synthesis procedure that sequentially combines DP data synthesis with a fairness-aware data preprocessing step. SAFES allows users flexibility in navigating the privacy-fairness-utility trade-offs. We illustrate SAFES with different DP synthesizers and fairness-aware data preprocessing methods and run extensive experiments on multiple real datasets to examine the privacy-fairness-utility trade-offs of synthetic data generated by SAFES. Empirical evaluations demonstrate that for reasonable privacy loss, SAFES-generated synthetic data can achieve significantly improved fairness metrics with relatively low utility loss.

SAFES: Sequential Privacy and Fairness Enhancing Data Synthesis for Responsible AI

TL;DR

SAFES tackles the joint problem of privacy and fairness in synthetic data by sequentially applying DP data synthesis followed by fairness-aware preprocessing, yielding a general, modular framework compatible with multiple DP synthesizers (e.g., AIM, DP-CTGAN) and fairness preprocessors (e.g., TOT, RW). The approach demonstrates that, under reasonable privacy budgets, SAFES can provide improved fairness with limited utility loss, offering a practical pathway to responsible AI data releases. The work also discusses the inherent trade-offs among privacy, fairness, and utility, shows robustness under varied settings, and highlights scalability constraints that motivate future work on efficiency and broader fairness definitions. Overall, SAFES contributes a principled, flexible toolkit for producing privacy-preserving, fairness-aware synthetic data for downstream ML tasks with real-world impact in domains like lending and criminal justice.

Abstract

As data-driven and AI-based decision making gains widespread adoption across disciplines, it is crucial that both data privacy and decision fairness are appropriately addressed. Although differential privacy (DP) provides a robust framework for guaranteeing privacy and methods are available to improve fairness, most prior work treats the two concerns separately. Even though there are existing approaches that consider privacy and fairness simultaneously, they typically focus on a single specific learning task, limiting their generalizability. In response, we introduce SAFES, a Sequential PrivAcy and Fairness Enhancing data Synthesis procedure that sequentially combines DP data synthesis with a fairness-aware data preprocessing step. SAFES allows users flexibility in navigating the privacy-fairness-utility trade-offs. We illustrate SAFES with different DP synthesizers and fairness-aware data preprocessing methods and run extensive experiments on multiple real datasets to examine the privacy-fairness-utility trade-offs of synthetic data generated by SAFES. Empirical evaluations demonstrate that for reasonable privacy loss, SAFES-generated synthetic data can achieve significantly improved fairness metrics with relatively low utility loss.

Paper Structure

This paper contains 37 sections, 2 theorems, 14 equations, 29 figures, 14 tables, 3 algorithms.

Key Result

Theorem 3

Let $\mathcal{M}$ be a mechanism satisfying $\rho$-zCDP. For any given $\varepsilon\ge0$, $\mathcal{M}$ satisfies $(\varepsilon, \delta)$-DP with

Figures (29)

  • Figure 1: The SAFES procedure and its applications
  • Figure 1: Mean $\pm$ 1 SD (error bars and shaded regions) summed TVD in each marginal set for 1-way, 2-way, and 3-way marginals between the synthetic data vs the original data for the Adult experiment.
  • Figure 2: Examples of the privacy (points on each line) vs fairness (y-axis) vs utility (x-axis) trade-off in the Adult experiment. In each plot, each point on a line represents the mean and the error bar indicates $\pm1$ SD over 35 repeats at a different privacy loss $\varepsilon$ value $\in\{10^{-2}\; (\hbox{rightmost}), 10^{-1.5}, 10^{-1}, \ldots, 10\; (\hbox{leftmost})\}$; lines represent different fairness parameters $\eta$; $x$-axis values further left correspond to better utility.
  • Figure 2: Mean $\pm$ 1 SD (error bars and shaded regions) test statistic and corresponding p-value for the KS test comparing original and synthetic datasets for the Adult experiment. Statistical significance threshold of $\alpha=0.05$ is marked in red in the plot on the right.
  • Figure 3: Examples of the privacy (points on each line) vs fairness (y-axis) vs utility (x-axis) trade-off in the COMPAS experiment. In each plot, each point on a line represents the mean and the error bar indicates $\pm1$ SD over 35 repeats at a different privacy loss parameter $\varepsilon$ value $\in\{10^{-2}\; (\hbox{rightmost}), 10^{-1.5}, 10^{-1}, \ldots, 10\; (\hbox{leftmost})\}$; lines represent different fairness parameters $\eta$; $x$-axis values further left correspond to better utility.
  • ...and 24 more figures

Theorems & Definitions (9)

  • Definition 1: ($\varepsilon, \delta)$-differential privacy Dwork2006OurDataDwork2006Calibrating
  • Definition 2: Zero-concentrated DP (zCDP) Bun2016
  • Theorem 3: Conversion of $\rho$-zCDP to $(\varepsilon, \delta)$-DP Canonne2020
  • Definition 4: Gaussian mechanism Dwork2006OurDataBun2016
  • Definition 5: Exponential mechanism McSherry2007
  • Definition 6: Conditional outcome difference (COD)
  • Definition 7: Statistical parity difference (SPD) and average odds difference (AOD)Dwork2012Hardt2016
  • Definition 8: conditional utility difference (CUD)
  • Proposition 9