Table of Contents
Fetching ...

Privacy-Preserving Fair Synthetic Tabular Data

Fatima J. Sarmin, Atiquer R. Rahman, Christopher J. Henry, Noman Mohammed

TL;DR

This work tackles the problem of sharing tabular data while preserving individual privacy and mitigating biases. It introduces PF-WGAN, a single-generator WGAN-GP augmented with identifiability-based privacy loss and demographic parity-based fairness loss to produce privacy-preserving, fair synthetic data. Across four real datasets, PF-WGAN demonstrates competitive utility with improved fairness and reduced identifiability risks compared to baselines, achieving a practical balance between privacy, fairness, and usefulness. The approach offers a scalable, architecture-preserving solution for ethical data sharing, with future directions including differential privacy integration and broader fairness metrics.

Abstract

Sharing of tabular data containing valuable but private information is limited due to legal and ethical issues. Synthetic data could be an alternative solution to this sharing problem, as it is artificially generated by machine learning algorithms and tries to capture the underlying data distribution. However, machine learning models are not free from memorization and may introduce biases, as they rely on training data. Producing synthetic data that preserves privacy and fairness while maintaining utility close to the real data is a challenging task. This research simultaneously addresses both the privacy and fairness aspects of synthetic data, an area not explored by other studies. In this work, we present PF-WGAN, a privacy-preserving, fair synthetic tabular data generator based on the WGAN-GP model. We have modified the original WGAN-GP by adding privacy and fairness constraints forcing it to produce privacy-preserving fair data. This approach will enable the publication of datasets that protect individual's privacy and remain unbiased toward any particular group. We compared the results with three state-of-the-art synthetic data generator models in terms of utility, privacy, and fairness across four different datasets. We found that the proposed model exhibits a more balanced trade-off among utility, privacy, and fairness.

Privacy-Preserving Fair Synthetic Tabular Data

TL;DR

This work tackles the problem of sharing tabular data while preserving individual privacy and mitigating biases. It introduces PF-WGAN, a single-generator WGAN-GP augmented with identifiability-based privacy loss and demographic parity-based fairness loss to produce privacy-preserving, fair synthetic data. Across four real datasets, PF-WGAN demonstrates competitive utility with improved fairness and reduced identifiability risks compared to baselines, achieving a practical balance between privacy, fairness, and usefulness. The approach offers a scalable, architecture-preserving solution for ethical data sharing, with future directions including differential privacy integration and broader fairness metrics.

Abstract

Sharing of tabular data containing valuable but private information is limited due to legal and ethical issues. Synthetic data could be an alternative solution to this sharing problem, as it is artificially generated by machine learning algorithms and tries to capture the underlying data distribution. However, machine learning models are not free from memorization and may introduce biases, as they rely on training data. Producing synthetic data that preserves privacy and fairness while maintaining utility close to the real data is a challenging task. This research simultaneously addresses both the privacy and fairness aspects of synthetic data, an area not explored by other studies. In this work, we present PF-WGAN, a privacy-preserving, fair synthetic tabular data generator based on the WGAN-GP model. We have modified the original WGAN-GP by adding privacy and fairness constraints forcing it to produce privacy-preserving fair data. This approach will enable the publication of datasets that protect individual's privacy and remain unbiased toward any particular group. We compared the results with three state-of-the-art synthetic data generator models in terms of utility, privacy, and fairness across four different datasets. We found that the proposed model exhibits a more balanced trade-off among utility, privacy, and fairness.

Paper Structure

This paper contains 16 sections, 7 equations, 4 figures, 4 tables, 1 algorithm.

Figures (4)

  • Figure 1: (a) Basic Generative Adversarial Networks (GANs) Architecture; (b) Proposed Network Architecture (Privacy preserving Fair WGAN: PF-GAN)
  • Figure 2: Result: Comparison among different models for utility (AUC-ROC score) using different datasets.
  • Figure 3: Result: Comparison among different models for fairness using checking demographic parity in generated synthetic data.
  • Figure 4: Result: Comparison among different models for privacy.