An evaluation framework for synthetic data generation models
Ioannis E. Livieris, Nikos Alimpertis, George Domalis, Dimitris Tsakalidis
TL;DR
The paper addresses the challenge of assessing synthetic data quality under privacy constraints by proposing a statistically grounded, multivariate evaluation framework. It ranks synthetic data generation models using a suite of tests—Wasserstein-Cramer's V, Novelty, Domain classifier, and Anomaly detection—with statistical significance assessed via Friedman Aligned-Ranks and Finner post-hoc tests. The framework is demonstrated on two real-world tabular datasets (Travel Review Ratings and Obesity risk), comparing multiple generators (GMM, Gaussian Copula, CTGAN, TVAE, CopulaGAN) and yielding nuanced rankings rather than a single best model. The approach is flexible and transferable to other data modalities, with future work including extensions to image data and additional evaluation tests and weighting schemes.
Abstract
Nowadays, the use of synthetic data has gained popularity as a cost-efficient strategy for enhancing data augmentation for improving machine learning models performance as well as addressing concerns related to sensitive data privacy. Therefore, the necessity of ensuring quality of generated synthetic data, in terms of accurate representation of real data, consists of primary importance. In this work, we present a new framework for evaluating synthetic data generation models' ability for developing high-quality synthetic data. The proposed approach is able to provide strong statistical and theoretical information about the evaluation framework and the compared models' ranking. Two use case scenarios demonstrate the applicability of the proposed framework for evaluating the ability of synthetic data generation models to generated high quality data. The implementation code can be found in https://github.com/novelcore/synthetic_data_evaluation_framework.
