Table of Contents
Fetching ...

VFLGAN: Vertical Federated Learning-based Generative Adversarial Network for Vertically Partitioned Data Publication

Xun Yuan, Yang Yang, Prosanta Gope, Aryan Pasikhani, Biplab Sikdar

TL;DR

The paper addresses privacy-preserving publication of synthetic data from vertically partitioned datasets by introducing VFLGAN, a vertical federated learning-based GAN that effectively learns cross-party attribute correlations. It extends to DP-VFLGAN with a novel Gaussian mechanism to achieve $(\epsilon,\delta)$-DP and provides a practical auditing framework (ASSD/ASIF) to empirically assess privacy leakage beyond worst-case DP. Empirical results on MNIST and multiple tabular datasets show VFLGAN substantially improves synthetic data quality over VertiGAN and that the DP variant maintains utility within DP budgets while reducing leakage. The work offers a practical approach for GDPR-compliant synthetic data publication with robust privacy analysis and provides a basis for future exploration of internal threat models.

Abstract

In the current artificial intelligence (AI) era, the scale and quality of the dataset play a crucial role in training a high-quality AI model. However, good data is not a free lunch and is always hard to access due to privacy regulations like the General Data Protection Regulation (GDPR). A potential solution is to release a synthetic dataset with a similar distribution to that of the private dataset. Nevertheless, in some scenarios, it has been found that the attributes needed to train an AI model belong to different parties, and they cannot share the raw data for synthetic data publication due to privacy regulations. In PETS 2023, Xue et al. proposed the first generative adversary network-based model, VertiGAN, for vertically partitioned data publication. However, after thoroughly investigating, we found that VertiGAN is less effective in preserving the correlation among the attributes of different parties. This article proposes a Vertical Federated Learning-based Generative Adversarial Network, VFLGAN, for vertically partitioned data publication to address the above issues. Our experimental results show that compared with VertiGAN, VFLGAN significantly improves the quality of synthetic data. Taking the MNIST dataset as an example, the quality of the synthetic dataset generated by VFLGAN is 3.2 times better than that generated by VertiGAN w.r.t. the Fréchet Distance. We also designed a more efficient and effective Gaussian mechanism for the proposed VFLGAN to provide the synthetic dataset with a differential privacy guarantee. On the other hand, differential privacy only gives the upper bound of the worst-case privacy guarantee. This article also proposes a practical auditing scheme that applies membership inference attacks to estimate privacy leakage through the synthetic dataset.

VFLGAN: Vertical Federated Learning-based Generative Adversarial Network for Vertically Partitioned Data Publication

TL;DR

The paper addresses privacy-preserving publication of synthetic data from vertically partitioned datasets by introducing VFLGAN, a vertical federated learning-based GAN that effectively learns cross-party attribute correlations. It extends to DP-VFLGAN with a novel Gaussian mechanism to achieve -DP and provides a practical auditing framework (ASSD/ASIF) to empirically assess privacy leakage beyond worst-case DP. Empirical results on MNIST and multiple tabular datasets show VFLGAN substantially improves synthetic data quality over VertiGAN and that the DP variant maintains utility within DP budgets while reducing leakage. The work offers a practical approach for GDPR-compliant synthetic data publication with robust privacy analysis and provides a basis for future exploration of internal threat models.

Abstract

In the current artificial intelligence (AI) era, the scale and quality of the dataset play a crucial role in training a high-quality AI model. However, good data is not a free lunch and is always hard to access due to privacy regulations like the General Data Protection Regulation (GDPR). A potential solution is to release a synthetic dataset with a similar distribution to that of the private dataset. Nevertheless, in some scenarios, it has been found that the attributes needed to train an AI model belong to different parties, and they cannot share the raw data for synthetic data publication due to privacy regulations. In PETS 2023, Xue et al. proposed the first generative adversary network-based model, VertiGAN, for vertically partitioned data publication. However, after thoroughly investigating, we found that VertiGAN is less effective in preserving the correlation among the attributes of different parties. This article proposes a Vertical Federated Learning-based Generative Adversarial Network, VFLGAN, for vertically partitioned data publication to address the above issues. Our experimental results show that compared with VertiGAN, VFLGAN significantly improves the quality of synthetic data. Taking the MNIST dataset as an example, the quality of the synthetic dataset generated by VFLGAN is 3.2 times better than that generated by VertiGAN w.r.t. the Fréchet Distance. We also designed a more efficient and effective Gaussian mechanism for the proposed VFLGAN to provide the synthetic dataset with a differential privacy guarantee. On the other hand, differential privacy only gives the upper bound of the worst-case privacy guarantee. This article also proposes a practical auditing scheme that applies membership inference attacks to estimate privacy leakage through the synthetic dataset.
Paper Structure (44 sections, 6 theorems, 22 equations, 13 figures, 6 tables, 5 algorithms)

This paper contains 44 sections, 6 theorems, 22 equations, 13 figures, 6 tables, 5 algorithms.

Key Result

Proposition 1

(Gaussian Mechanism) Let $f: D \rightarrow R$ be an arbitrary function with sensitivity being $\Delta_2 f=\max _{D, D^{\prime}}\left\|f(D)-f\left(D^{\prime}\right)\right\|_2$ for any adjacent $D, D^\prime \in \mathcal{D}$. The Gaussian Mechanism $M_\sigma$, $\mathcal{M}_\sigma(\boldsymbol{x})=f(\bol

Figures (13)

  • Figure 1: This figure shows synthetic samples generated by GANs trained on vertically partitioned MNIST data (the digits are split evenly into upper and lower halves). The left figure displays samples generated by VertiGAN Xue01, highlighting unrecognizable and discontinuous digits. The right figure shows samples generated by the proposed VFLGAN.
  • Figure 2: This figure shows the framework of VertiGAN.
  • Figure 3: System Model.
  • Figure 4: Framework of the proposed VFLGAN.
  • Figure 5: FD curves (lower is better) and IS curves (higher is better) on the MNIST dataset.
  • ...and 8 more figures

Theorems & Definitions (9)

  • Definition 1
  • Definition 2
  • Definition 3
  • Proposition 1
  • Proposition 2
  • Proposition 3
  • Proposition 4
  • Proposition 5
  • Theorem 1