A Correlation- and Mean-Aware Loss Function and Benchmarking Framework to Improve GAN-based Tabular Data Synthesis
Minh H. Vu, Daniel Edler, Carl Wibom, Tommy Löfstedt, Beatrice Melin, Martin Rosvall
TL;DR
This work tackles synthetic tabular data generation by proposing a correlation- and mean-aware regularizer for GAN-based models, addressing the challenges of mixed-type features and inter-feature dependencies in medical data. It provides a formal loss formulation with toggleable correlation and mean terms, integrated into TVAE and applicable to multiple tabular GANs. A rigorous benchmarking framework evaluates ten real-world datasets against eight baselines, using Friedman and Nemenyi tests to assess statistical similarity, augmentation potential, and downstream task performance. Across diverse settings, the combined correlation- and mean-aware loss ($c_1m_1$) frequently improves synthetic data quality and downstream ML performance, supporting easier data sharing while acknowledging model- and dataset-specific variability.
Abstract
Advancements in science rely on data sharing. In medicine, where personal data are often involved, synthetic tabular data generated by generative adversarial networks (GANs) offer a promising avenue. However, existing GANs struggle to capture the complexities of real-world tabular data, which often contain a mix of continuous and categorical variables with potential imbalances and dependencies. We propose a novel correlation- and mean-aware loss function designed to address these challenges as a regularizer for GANs. To ensure a rigorous evaluation, we establish a comprehensive benchmarking framework using ten real-world datasets and eight established tabular GAN baselines. The proposed loss function demonstrates statistically significant improvements over existing methods in capturing the true data distribution, significantly enhancing the quality of synthetic data generated with GANs. The benchmarking framework shows that the enhanced synthetic data quality leads to improved performance in downstream machine learning (ML) tasks, ultimately paving the way for easier data sharing.
