Table of Contents
Fetching ...

VT-GAN: Cooperative Tabular Data Synthesis using Vertical Federated Learning

Zilong Zhao, Han Wu, Aad Van Moorsel, Lydia Y. Chen

TL;DR

VT-GAN tackles the privacy challenge of generating synthetic tabular data from distributed sources by applying vertical federated learning to state-of-the-art tabular GANs. It partitions generator and discriminator components between a central server and multiple clients, and introduces a training-with-shuffling mechanism to securely incorporate conditional vectors for CGAN-based data synthesis. Empirical results on five datasets show synthetic data quality comparable to centralized baselines, with careful partitioning and more robust performance under imbalanced client distributions; DP can mitigate membership inference attacks but at a notable cost to data utility. The work demonstrates the practical feasibility of privacy-preserving distributed data synthesis and provides guidance on partition strategies, privacy risk, and scalability, while noting remaining challenges such as collusion resistance and data-poisoning defenses.

Abstract

This paper presents the application of Vertical Federated Learning (VFL) to generate synthetic tabular data using Generative Adversarial Networks (GANs). VFL is a collaborative approach to train machine learning models among distinct tabular data holders, such as financial institutions, who possess disjoint features for the same group of customers. In this paper we introduce the VT-GAN framework, Vertical federated Tabular GAN, and demonstrate that VFL can be successfully used to implement GANs for distributed tabular data in privacy-preserving manner, with performance close to centralized GANs that assume shared data. We make design choices with respect to the distribution of GAN generator and discriminator models and introduce a training-with-shuffling technique so that no party can reconstruct training data from the GAN conditional vector. The paper presents (1) an implementation of VT-GAN, (2) a detailed quality evaluation of the VT-GAN-generated synthetic data, (3) an overall scalability examination of VT-GAN framework, (4) a security analysis on VT-GAN's robustness against Membership Inference Attack with different settings of Differential Privacy, for a range of datasets with diverse distribution characteristics. Our results demonstrate that VT-GAN can consistently generate high-fidelity synthetic tabular data of comparable quality to that generated by a centralized GAN algorithm. The difference in machine learning utility can be as low as 2.7%, even under extremely imbalanced data distributions across clients or with different numbers of clients.

VT-GAN: Cooperative Tabular Data Synthesis using Vertical Federated Learning

TL;DR

VT-GAN tackles the privacy challenge of generating synthetic tabular data from distributed sources by applying vertical federated learning to state-of-the-art tabular GANs. It partitions generator and discriminator components between a central server and multiple clients, and introduces a training-with-shuffling mechanism to securely incorporate conditional vectors for CGAN-based data synthesis. Empirical results on five datasets show synthetic data quality comparable to centralized baselines, with careful partitioning and more robust performance under imbalanced client distributions; DP can mitigate membership inference attacks but at a notable cost to data utility. The work demonstrates the practical feasibility of privacy-preserving distributed data synthesis and provides guidance on partition strategies, privacy risk, and scalability, while noting remaining challenges such as collusion resistance and data-poisoning defenses.

Abstract

This paper presents the application of Vertical Federated Learning (VFL) to generate synthetic tabular data using Generative Adversarial Networks (GANs). VFL is a collaborative approach to train machine learning models among distinct tabular data holders, such as financial institutions, who possess disjoint features for the same group of customers. In this paper we introduce the VT-GAN framework, Vertical federated Tabular GAN, and demonstrate that VFL can be successfully used to implement GANs for distributed tabular data in privacy-preserving manner, with performance close to centralized GANs that assume shared data. We make design choices with respect to the distribution of GAN generator and discriminator models and introduce a training-with-shuffling technique so that no party can reconstruct training data from the GAN conditional vector. The paper presents (1) an implementation of VT-GAN, (2) a detailed quality evaluation of the VT-GAN-generated synthetic data, (3) an overall scalability examination of VT-GAN framework, (4) a security analysis on VT-GAN's robustness against Membership Inference Attack with different settings of Differential Privacy, for a range of datasets with diverse distribution characteristics. Our results demonstrate that VT-GAN can consistently generate high-fidelity synthetic tabular data of comparable quality to that generated by a centralized GAN algorithm. The difference in machine learning utility can be as low as 2.7%, even under extremely imbalanced data distributions across clients or with different numbers of clients.
Paper Structure (30 sections, 14 figures, 1 table, 1 algorithm)

This paper contains 30 sections, 14 figures, 1 table, 1 algorithm.

Figures (14)

  • Figure 1: Traditional VFL architecture for prediction model.
  • Figure 2: Conditional GAN (centralized).
  • Figure 3: The workflow of VT-GAN.
  • Figure 4: VT-GAN training without shuffling.
  • Figure 5: VT-GAN training-with-shuffling.
  • ...and 9 more figures