An evaluation framework for synthetic data generation models

Ioannis E. Livieris; Nikos Alimpertis; George Domalis; Dimitris Tsakalidis

An evaluation framework for synthetic data generation models

Ioannis E. Livieris, Nikos Alimpertis, George Domalis, Dimitris Tsakalidis

TL;DR

The paper addresses the challenge of assessing synthetic data quality under privacy constraints by proposing a statistically grounded, multivariate evaluation framework. It ranks synthetic data generation models using a suite of tests—Wasserstein-Cramer's V, Novelty, Domain classifier, and Anomaly detection—with statistical significance assessed via Friedman Aligned-Ranks and Finner post-hoc tests. The framework is demonstrated on two real-world tabular datasets (Travel Review Ratings and Obesity risk), comparing multiple generators (GMM, Gaussian Copula, CTGAN, TVAE, CopulaGAN) and yielding nuanced rankings rather than a single best model. The approach is flexible and transferable to other data modalities, with future work including extensions to image data and additional evaluation tests and weighting schemes.

Abstract

Nowadays, the use of synthetic data has gained popularity as a cost-efficient strategy for enhancing data augmentation for improving machine learning models performance as well as addressing concerns related to sensitive data privacy. Therefore, the necessity of ensuring quality of generated synthetic data, in terms of accurate representation of real data, consists of primary importance. In this work, we present a new framework for evaluating synthetic data generation models' ability for developing high-quality synthetic data. The proposed approach is able to provide strong statistical and theoretical information about the evaluation framework and the compared models' ranking. Two use case scenarios demonstrate the applicability of the proposed framework for evaluating the ability of synthetic data generation models to generated high quality data. The implementation code can be found in https://github.com/novelcore/synthetic_data_evaluation_framework.

An evaluation framework for synthetic data generation models

TL;DR

Abstract

An evaluation framework for synthetic data generation models

Authors

TL;DR

Abstract

Table of Contents

Figures (3)