A Correlation- and Mean-Aware Loss Function and Benchmarking Framework to Improve GAN-based Tabular Data Synthesis

Minh H. Vu; Daniel Edler; Carl Wibom; Tommy Löfstedt; Beatrice Melin; Martin Rosvall

A Correlation- and Mean-Aware Loss Function and Benchmarking Framework to Improve GAN-based Tabular Data Synthesis

Minh H. Vu, Daniel Edler, Carl Wibom, Tommy Löfstedt, Beatrice Melin, Martin Rosvall

TL;DR

This work tackles synthetic tabular data generation by proposing a correlation- and mean-aware regularizer for GAN-based models, addressing the challenges of mixed-type features and inter-feature dependencies in medical data. It provides a formal loss formulation with toggleable correlation and mean terms, integrated into TVAE and applicable to multiple tabular GANs. A rigorous benchmarking framework evaluates ten real-world datasets against eight baselines, using Friedman and Nemenyi tests to assess statistical similarity, augmentation potential, and downstream task performance. Across diverse settings, the combined correlation- and mean-aware loss ($c_1m_1$) frequently improves synthetic data quality and downstream ML performance, supporting easier data sharing while acknowledging model- and dataset-specific variability.

Abstract

Advancements in science rely on data sharing. In medicine, where personal data are often involved, synthetic tabular data generated by generative adversarial networks (GANs) offer a promising avenue. However, existing GANs struggle to capture the complexities of real-world tabular data, which often contain a mix of continuous and categorical variables with potential imbalances and dependencies. We propose a novel correlation- and mean-aware loss function designed to address these challenges as a regularizer for GANs. To ensure a rigorous evaluation, we establish a comprehensive benchmarking framework using ten real-world datasets and eight established tabular GAN baselines. The proposed loss function demonstrates statistically significant improvements over existing methods in capturing the true data distribution, significantly enhancing the quality of synthetic data generated with GANs. The benchmarking framework shows that the enhanced synthetic data quality leads to improved performance in downstream machine learning (ML) tasks, ultimately paving the way for easier data sharing.

A Correlation- and Mean-Aware Loss Function and Benchmarking Framework to Improve GAN-based Tabular Data Synthesis

TL;DR

) frequently improves synthetic data quality and downstream ML performance, supporting easier data sharing while acknowledging model- and dataset-specific variability.

Abstract

Paper Structure (15 sections, 8 equations, 15 figures, 15 tables)

This paper contains 15 sections, 8 equations, 15 figures, 15 tables.

Introduction
Related Work
Methods
Correlation- and Mean-Aware Loss Function
Statistical Tests
Benchmarking Framework
Experiments
Datasets
Implementation Details and Training
Results and Discussion
Conclusion
Supplementary Results
Statistical Tests
Dataset-Specific Quantitative Results
Method-Specific Quantitative Results

Figures (15)

Figure 1: A standard for tabular data.
Figure 2: Proposed benchmarking framework.
Figure 3: Distributions of the statistical evaluation metrics achieved with different loss functions for the Optimal models. All performance scores range from $0$ to $1$, with higher values indicating better performance.
Figure 4: Distributions of the evaluation metrics for the classification task achieved with different loss functions for the Optimal models. All performance scores range from $0$ to $1$, with higher values indicating better performance.
Figure 5: Distributions of the augmentation evaluation metrics for the classification task achieved with different loss functions for the Optimal models. All performance scores range from $0$ to $1$, with higher values indicating better performance.
...and 10 more figures

A Correlation- and Mean-Aware Loss Function and Benchmarking Framework to Improve GAN-based Tabular Data Synthesis

TL;DR

Abstract

A Correlation- and Mean-Aware Loss Function and Benchmarking Framework to Improve GAN-based Tabular Data Synthesis

Authors

TL;DR

Abstract

Table of Contents

Figures (15)