High-Quality Tabular Data Generation using Post-Selected VAE

Volodymyr Shulakov

High-Quality Tabular Data Generation using Post-Selected VAE

Volodymyr Shulakov

TL;DR

PSVAE addresses privacy-preserving synthetic tabular data generation by extending a variational autoencoder with an automatic loss-balancing mechanism and a post-selection step to refine decoder outputs. The method uses Mish activations, discretizes continuous features into $min(sqrt(N), 100)$ buckets, and applies inverse-frequency weighting for imbalanced categories, optimizing $L = L_{RE} + beta L_{KL}$ with an adaptive $beta$. Experimental results on Brain Stroke, Diabetes, and Credit Card Fraud show lower $L_1$ distances and competitive or improved correlation and F1 scores, while achieving faster runtimes than prior methods. The work highlights the importance of loss balancing, post-selection, and mixed-data handling, and suggests vector-quantization as future enhancement.

Abstract

Synthetic tabular data is becoming a necessity as concerns about data privacy intensify in the world. Tabular data can be useful for testing various systems, simulating real data, analyzing the data itself or building predictive models. Unfortunately, such data may not be available due to confidentiality issues. Previous techniques, such as TVAE (Xu et al., 2019) or OCTGAN (Kim et al., 2021), are either unable to handle particularly complex datasets, or are complex in themselves, resulting in inferior run time performance. This paper introduces PSVAE, a new simple model that is capable of producing high-quality synthetic data in less run time. PSVAE incorporates two key ideas: loss optimization and post-selection. Along with these ideas, the proposed model compensates for underrepresented categories and uses a modern activation function, Mish (Misra, 2019).

High-Quality Tabular Data Generation using Post-Selected VAE

TL;DR

buckets, and applies inverse-frequency weighting for imbalanced categories, optimizing

with an adaptive

. Experimental results on Brain Stroke, Diabetes, and Credit Card Fraud show lower

distances and competitive or improved correlation and F1 scores, while achieving faster runtimes than prior methods. The work highlights the importance of loss balancing, post-selection, and mixed-data handling, and suggests vector-quantization as future enhancement.

Abstract

Paper Structure (5 sections, 3 equations, 2 figures, 2 tables, 2 algorithms)

This paper contains 5 sections, 3 equations, 2 figures, 2 tables, 2 algorithms.

Introduction
Related Work
Post-Selected VAE-based Data Generator
Experiments
Conclusion

Figures (2)

Figure 1: Illustration of the synthetic data generation workflow of PSVAE
Figure 2:

High-Quality Tabular Data Generation using Post-Selected VAE

TL;DR

Abstract

High-Quality Tabular Data Generation using Post-Selected VAE

Authors

TL;DR

Abstract

Table of Contents

Figures (2)