Table of Contents
Fetching ...

Bayesian Generative Adversarial Networks via Gaussian Approximation for Tabular Data Synthesis

Bahrul Ilmi Nasution, Mark Elliot, Richard Allmendinger

TL;DR

Gaussian Approximation of CTGAN (GACTGAN), an integration of the Bayesian posterior approximation technique using Stochastic Weight Averaging-Gaussian (SWAG) within the CTGAN generator to synthesise tabular data, reducing computational overhead after the training phase is introduced.

Abstract

Generative Adversarial Networks (GAN) have been used in many studies to synthesise mixed tabular data. Conditional tabular GAN (CTGAN) have been the most popular variant but struggle to effectively navigate the risk-utility trade-off. Bayesian GAN have received less attention for tabular data, but have been explored with unstructured data such as images and text. The most used technique employed in Bayesian GAN is Markov Chain Monte Carlo (MCMC), but it is computationally intensive, particularly in terms of weight storage. In this paper, we introduce Gaussian Approximation of CTGAN (GACTGAN), an integration of the Bayesian posterior approximation technique using Stochastic Weight Averaging-Gaussian (SWAG) within the CTGAN generator to synthesise tabular data, reducing computational overhead after the training phase. We demonstrate that GACTGAN yields better synthetic data compared to CTGAN, achieving better preservation of tabular structure and inferential statistics with less privacy risk. These results highlight GACTGAN as a simpler, effective implementation of Bayesian tabular synthesis.

Bayesian Generative Adversarial Networks via Gaussian Approximation for Tabular Data Synthesis

TL;DR

Gaussian Approximation of CTGAN (GACTGAN), an integration of the Bayesian posterior approximation technique using Stochastic Weight Averaging-Gaussian (SWAG) within the CTGAN generator to synthesise tabular data, reducing computational overhead after the training phase is introduced.

Abstract

Generative Adversarial Networks (GAN) have been used in many studies to synthesise mixed tabular data. Conditional tabular GAN (CTGAN) have been the most popular variant but struggle to effectively navigate the risk-utility trade-off. Bayesian GAN have received less attention for tabular data, but have been explored with unstructured data such as images and text. The most used technique employed in Bayesian GAN is Markov Chain Monte Carlo (MCMC), but it is computationally intensive, particularly in terms of weight storage. In this paper, we introduce Gaussian Approximation of CTGAN (GACTGAN), an integration of the Bayesian posterior approximation technique using Stochastic Weight Averaging-Gaussian (SWAG) within the CTGAN generator to synthesise tabular data, reducing computational overhead after the training phase. We demonstrate that GACTGAN yields better synthetic data compared to CTGAN, achieving better preservation of tabular structure and inferential statistics with less privacy risk. These results highlight GACTGAN as a simpler, effective implementation of Bayesian tabular synthesis.
Paper Structure (28 sections, 9 equations, 6 figures, 4 tables)

This paper contains 28 sections, 9 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: An iteration process of GACTGAN. Note that the generator and discriminator is trained regularly. However, after the backpropagation in the generator, the weights are stored to update the mean and mean of squared weights. The mean and new weights are used to store the deviation matrix, while the diagonal covariance is constructed from the mean and mean of squared weights.
  • Figure 2: Heatmap visualisation of correlation difference for the CA dataset across different generative models (columns) and loss functions (rows). The difference is computed by subtracting the real data correlation from the synthetic data correlation. Each inner $3 \times 3$ grid displays the pairwise correlation differences between synthetic and real data for the dataset's continuous variables. Lighter tiles indicate minimal deviation from the real data (better preservation of multivariate dependencies), while darker blue tiles highlight otherwise. GACTGAN variants (right columns) predominantly exhibit lighter tones, demonstrating superior preservation of correlation structures compared to baselines like CTGAN and BayesCTGAN.
  • Figure 3: Descriptive statistics of synthetic and original data: (a) and (b) represent cross-tabulation of house ownership among different sex in the UK, and (c) and (d) represent the distribution of estimated salary and credit score in CH, respectively. Each subplot compares the synthetic data generated by different methods (CTGAN, BayesCTGAN, ACTGAN, GACTGAN with varying parameters) to the original data, highlighting the alignment between the synthetic and original distributions across different demographics and variables. The top and bottom part of each subfigure represent Wasserstein and vanilla loss.
  • Figure 4: R-U map of CTGAN, Bayesian GAN, and GACTGAN. The purple lines indicated the solution candidates based on Pareto front.
  • Figure 5: Utility-Risk map of GACTGAN based on number of posterior samples. The blue line showed the trend between each number of samples.
  • ...and 1 more figures