Table of Contents
Fetching ...

An improved tabular data generator with VAE-GMM integration

Patricia A. Apellániz, Juan Parras, Santiago Zazo

TL;DR

This work proposes a novel approach based on Variational Autoencoders enhanced with a Bayesian Gaussian Mixture (BGM) model that shows promise as a valuable tool for synthetic tabular data generation across diverse domains, particularly in healthcare.

Abstract

The rising use of machine learning in various fields requires robust methods to create synthetic tabular data. Data should preserve key characteristics while addressing data scarcity challenges. Current approaches based on Generative Adversarial Networks, such as the state-of-the-art CTGAN model, struggle with the complex structures inherent in tabular data. These data often contain both continuous and discrete features with non-Gaussian distributions. Therefore, we propose a novel Variational Autoencoder (VAE)-based model that addresses these limitations. Inspired by the TVAE model, our approach incorporates a Bayesian Gaussian Mixture model (BGM) within the VAE architecture. This avoids the limitations imposed by assuming a strictly Gaussian latent space, allowing for a more accurate representation of the underlying data distribution during data generation. Furthermore, our model offers enhanced flexibility by allowing the use of various differentiable distributions for individual features, making it possible to handle both continuous and discrete data types. We thoroughly validate our model on three real-world datasets with mixed data types, including two medically relevant ones, based on their resemblance and utility. This evaluation demonstrates significant outperformance against CTGAN and TVAE, establishing its potential as a valuable tool for generating synthetic tabular data in various domains, particularly in healthcare.

An improved tabular data generator with VAE-GMM integration

TL;DR

This work proposes a novel approach based on Variational Autoencoders enhanced with a Bayesian Gaussian Mixture (BGM) model that shows promise as a valuable tool for synthetic tabular data generation across diverse domains, particularly in healthcare.

Abstract

The rising use of machine learning in various fields requires robust methods to create synthetic tabular data. Data should preserve key characteristics while addressing data scarcity challenges. Current approaches based on Generative Adversarial Networks, such as the state-of-the-art CTGAN model, struggle with the complex structures inherent in tabular data. These data often contain both continuous and discrete features with non-Gaussian distributions. Therefore, we propose a novel Variational Autoencoder (VAE)-based model that addresses these limitations. Inspired by the TVAE model, our approach incorporates a Bayesian Gaussian Mixture model (BGM) within the VAE architecture. This avoids the limitations imposed by assuming a strictly Gaussian latent space, allowing for a more accurate representation of the underlying data distribution during data generation. Furthermore, our model offers enhanced flexibility by allowing the use of various differentiable distributions for individual features, making it possible to handle both continuous and discrete data types. We thoroughly validate our model on three real-world datasets with mixed data types, including two medically relevant ones, based on their resemblance and utility. This evaluation demonstrates significant outperformance against CTGAN and TVAE, establishing its potential as a valuable tool for generating synthetic tabular data in various domains, particularly in healthcare.
Paper Structure (7 sections, 2 equations, 3 figures, 3 tables)

This paper contains 7 sections, 2 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Bayesian VAE vanilla model. $z$ represents the latent variable, while $x$ denotes the observable. $p_\theta(x\vert z)$ and $q_\phi(z\vert x)$ are the generative model and the variational approximation to the unknown true posterior $p(z\vert x)$, respectively.
  • Figure 2: Proposed model architecture. It is built on a standard VAE. After training, the latent space $z$ is modeled using a GMM. This creates a new space $z_{GM}$, which serves as the basis for generating new distribution parameters and ultimately sampling new data points.
  • Figure 3: Latent space comparison. 300 samples from each VAE, TVAE, and proposed model's latent space are shown. The BGM-modeled latent space (orange) closely aligns with $z$ (blue) compared to the TVAE's (green). This demonstrates the importance of BGM for capturing the latent space distribution and potentially leading to higher-quality generated samples.