SynthEval: A Framework for Detailed Utility and Privacy Evaluation of Tabular Synthetic Data

Anton Danholt Lautrup; Tobias Hyrup; Arthur Zimek; Peter Schneider-Kamp

SynthEval: A Framework for Detailed Utility and Privacy Evaluation of Tabular Synthetic Data

Anton Danholt Lautrup, Tobias Hyrup, Arthur Zimek, Peter Schneider-Kamp

TL;DR

SynthEval addresses the lack of standardized, multi-metric evaluation for tabular synthetic data by providing an open-source, modular framework that treats numeric and categorical attributes uniformly. It offers a comprehensive library of utility and privacy metrics, configurable presets, and a multi-axis benchmark module that supports ranking of multiple synthetic datasets. A nearest-neighbour approach using Gower distance enables mixed-type data without one-hot encoding, facilitating scalable, unbiased comparisons. Through a real-world Hepatitis C dataset demonstration, the paper shows how SynthEval enables flexible benchmarking, model tuning for privacy-utility trade-offs, and clear identification of strengths and weaknesses across generative models, advancing reproducible evaluation of synthetic tabular data.

Abstract

With the growing demand for synthetic data to address contemporary issues in machine learning, such as data scarcity, data fairness, and data privacy, having robust tools for assessing the utility and potential privacy risks of such data becomes crucial. SynthEval, a novel open-source evaluation framework distinguishes itself from existing tools by treating categorical and numerical attributes with equal care, without assuming any special kind of preprocessing steps. This~makes it applicable to virtually any synthetic dataset of tabular records. Our tool leverages statistical and machine learning techniques to comprehensively evaluate synthetic data fidelity and privacy-preserving integrity. SynthEval integrates a wide selection of metrics that can be used independently or in highly customisable benchmark configurations, and can easily be extended with additional metrics. In this paper, we describe SynthEval and illustrate its versatility with examples. The framework facilitates better benchmarking and more consistent comparisons of model capabilities.

SynthEval: A Framework for Detailed Utility and Privacy Evaluation of Tabular Synthetic Data

TL;DR

Abstract

Paper Structure (38 sections, 3 equations, 9 figures, 3 tables)

This paper contains 38 sections, 3 equations, 9 figures, 3 tables.

Background and Motivation
Related Work
Methods and Technical Solutions
Library of Metrics
Heterogeneous Data
Pre-configurations
Extensibility
Benchmark Module
Ranking Strategies
Application of SynthEval
An Example of Dataset Benchmarking
Preparation
Generating Synthetic Data
Model Tuning and Selection
Evaluation and Comparison
...and 23 more sections

Figures (9)

Figure 1: Average Relative Error vs. Attribute Type Proportion. The figure shows how SynthEval, SynthCity, and TableEvaluator, each deviate from the baseline (all columns used) under repeated (10) subsampling of the categoricals and numericals in various mixtures using the same dataset (Egyptian Hepatitis C Dataset, see Table \ref{['tab:data']}, and synthetic version generated using synthpop Nowok2016). The metric used was the "Similarity score" in TableEvaluator and the sum of results from a selection of metrics in both SynthEval and SynthCity. Specifically; corr_diff, mi_diff, ks_test, h_dist, nnaa, eps_risk, and dcr for SynthEval, and close_values_probability, chi_squared_test, feature_corr, inv_kl_divergence, ks_test, nearest_syn_neighbor_distance, jensenshannon_dist, max_mean_discrepancy and identifiability_score for SynthCity (the last four were taken as one minus their value for the summation). SynthEval has only about $1\%$ deviation in all mixtures, whereas the TableEvaluator tool is only somewhat consistent in the intermediate mixtures at around $7$ to $8\%$ error, and fluctuates more in the extremes. SynthCity is less erratic but shows a steady decrease in error when fewer categorical columns are used.
Figure 2: Sketch of The SynthEval Framework. The diagram shows the primary two workflows contained within SynthEval: single dataset evaluation and multi-dataset benchmarking. The standard evaluation module, allows the generation of detailed evaluation reports on a wide selection of metrics, using preset configuration and/or a manual selection of metrics (including custom metrics). The benchmark module enables the evaluation of multiple synthetic datasets simultaneously and returns the results in a joint table. Furthermore, the benchmark module ranks the results according to a specified ranking strategy, facilitating the identification of standout datasets. Finally, the individual metrics can also be accessed without entering the framework (not shown in the figure), requiring real data and synthetic data as inputs. In this configuration, the metrics can access the preprocessing utilities and take care of this step accordingly.
Figure 3: Mixed Correlation and Mutual Information Matrix Difference for the Optimised BN. Both plots are produced by SynthEval. Left: The figure shows the mixed correlation difference map of the real and synthetic data. It is evident that although the BN model had one of the worst correlation difference coefficients in Table \ref{['tab:results']}, it is only because some few variable interplays have been misrepresented. Right: The figure shows the mutual information difference map of the real and synthetic data. Some of the same misrepresentations of variable relationships seen on the left are found with this approach.
Figure 4: Kolmogorov-Smirnov test, significantly dissimilar variables of the optimised BN dataset. This figure shows how the values in the 12 variables that the KS test identified as significantly dissimilar are distributed. As is evident, with these hyperparameters the optimised BN generative model tends to balance the categorical attributes and to fill the gaps between outliers and the main population for numericals.
Figure 5: Outline of Model Benchmark Study Design. A selection of generative models is applied to a large collection of benchmark datasets. The resulting synthetic datasets are ranked internally on utility and privacy, and the results across all the different benchmark datasets are aggregated, to select the model that best solves the problem specification. It may be worthwhile to also look into subclusters of the benchmark datasets, as some models may perform better/worse on special niches, e.g., large/small datasets or datasets with lots of binary variables etc.
...and 4 more figures

SynthEval: A Framework for Detailed Utility and Privacy Evaluation of Tabular Synthetic Data

TL;DR

Abstract

SynthEval: A Framework for Detailed Utility and Privacy Evaluation of Tabular Synthetic Data

Authors

TL;DR

Abstract

Table of Contents

Figures (9)