Synthesizing Tabular Data Using Selectivity Enhanced Generative Adversarial Networks
Youran Zhou, Jianzhong Qi
TL;DR
This work targets synthetic tabular data generation for E-commerce stress testing by enforcing query selectivity constraints during GAN training. It introduces SelGAN, a flexible, selectivity-enhanced framework that transforms mixed-type data, employs a pre-trained selectivity estimator, and augments the generator loss to satisfy selectivity constraints. Across five real-world datasets and multiple baselines, SelGAN improves selectivity estimation by up to ~20% (lower $MSE$) and downstream ML utility by up to ~6% in F1 and ~20% in $MSE$, demonstrating robust improvements over state-of-the-art tabular GANs and VAEs. The approach is modular and can be applied to other base GANs, paving the way for more realistic, privacy-preserving synthetic data suitable for workflow and resource planning in large-scale transaction systems.
Abstract
As E-commerce platforms face surging transactions during major shopping events like Black Friday, stress testing with synthesized data is crucial for resource planning. Most recent studies use Generative Adversarial Networks (GANs) to generate tabular data while ensuring privacy and machine learning utility. However, these methods overlook the computational demands of processing GAN-generated data, making them unsuitable for E-commerce stress testing. This thesis introduces a novel GAN-based approach incorporating query selectivity constraints, a key factor in database transaction processing. We integrate a pre-trained deep neural network to maintain selectivity consistency between real and synthetic data. Our method, tested on five real-world datasets, outperforms three state-of-the-art GANs and a VAE model, improving selectivity estimation accuracy by up to 20pct and machine learning utility by up to 6 pct.
