Table of Contents
Fetching ...

Benchmarking the Fidelity and Utility of Synthetic Relational Data

Valter Hudovernik, Martin Jurkovič, Erik Štrumbelj

TL;DR

This work combines the best practices and a novel robust detection approach into a benchmarking tool and uses it to compare six methods, including two commercial tools, and concludes that no method is able to synthesize a dataset that is indistinguishable from original data.

Abstract

Synthesizing relational data has started to receive more attention from researchers, practitioners, and industry. The task is more difficult than synthesizing a single table due to the added complexity of relationships between tables. For the same reason, benchmarking methods for synthesizing relational data introduces new challenges. Our work is motivated by a lack of an empirical evaluation of state-of-the-art methods and by gaps in the understanding of how such an evaluation should be done. We review related work on relational data synthesis, common benchmarking datasets, and approaches to measuring the fidelity and utility of synthetic data. We combine the best practices and a novel robust detection approach into a benchmarking tool and use it to compare six methods, including two commercial tools. While some methods are better than others, no method is able to synthesize a dataset that is indistinguishable from original data. For utility, we typically observe moderate correlation between real and synthetic data for both model predictive performance and feature importance.

Benchmarking the Fidelity and Utility of Synthetic Relational Data

TL;DR

This work combines the best practices and a novel robust detection approach into a benchmarking tool and uses it to compare six methods, including two commercial tools, and concludes that no method is able to synthesize a dataset that is indistinguishable from original data.

Abstract

Synthesizing relational data has started to receive more attention from researchers, practitioners, and industry. The task is more difficult than synthesizing a single table due to the added complexity of relationships between tables. For the same reason, benchmarking methods for synthesizing relational data introduces new challenges. Our work is motivated by a lack of an empirical evaluation of state-of-the-art methods and by gaps in the understanding of how such an evaluation should be done. We review related work on relational data synthesis, common benchmarking datasets, and approaches to measuring the fidelity and utility of synthetic data. We combine the best practices and a novel robust detection approach into a benchmarking tool and use it to compare six methods, including two commercial tools. While some methods are better than others, no method is able to synthesize a dataset that is indistinguishable from original data. For utility, we typically observe moderate correlation between real and synthetic data for both model predictive performance and feature importance.
Paper Structure (41 sections, 7 figures, 10 tables, 2 algorithms)

This paper contains 41 sections, 7 figures, 10 tables, 2 algorithms.

Figures (7)

  • Figure 1: Examples of marginal distributions on the Rossmann Dataset. Deep learning-based methods generally synthesise both categorical and continuous marginal distributions well enough to pass the eye test.
  • Figure 2: Maximum mean discrepancy (a) and pairwise correlation difference (b) on the Rossmann dataset. The dotted line indicates the 95% bootstrapped confidence interval of the metric on original data. Most methods model the parent table (store) better since the tests find more differences for the child table (historical). This example also illustrates the importance of interpreting a metric in the context of its uncertainty on original data.
  • Figure 3: Discrimination accuracy for DD and DD with aggregation. The results are for the parent tables. The red dashed line marks the expected 50% accuracy for perfectly generated data.
  • Figure 4: Feature importance for DD with aggregation using XGBoost. Results are for the best performing methods (lowest accuracy of DD). The added features that incorporate relational information (red) are the most important for discriminating between real and synthetic data.
  • Figure 5: Partial dependence plots. Results are for the 1st and 4th most important feature from Figure \ref{['fig:interpretability:b']}. With ideally generated synthetic data, features could not discriminate between synthetic and original data and every partial dependence plot would be a horizontal line at 50% probability. We can observe that (a) the synthetic data have too many unique actor cast numbers (higher probability of being synthetic when feature value is larger than 4) and (b) the mean movie ratings in the original data vary more than in the synthetic data, where they are more concentrated around 3.5.
  • ...and 2 more figures