Structured Evaluation of Synthetic Tabular Data

Scott Cheng-Hsin Yang; Baxter Eaves; Michael Schmidt; Ken Swanson; Patrick Shafto

Structured Evaluation of Synthetic Tabular Data

Scott Cheng-Hsin Yang, Baxter Eaves, Michael Schmidt, Ken Swanson, Patrick Shafto

TL;DR

The paper introduces a Structured Evaluation Framework for synthetic tabular data, unifying disparate evaluation metrics under the objective that synthetic samples should be drawn from the same joint distribution as the real data, $Q=P$ with $S \sim Q$. It decomposes distributions into a spectrum of substructures (marginals, pairwise, leave-one-out conditionals, full joint, missingness) and links both model-free and model-based metrics, including a PCC-based surrogate, to these substructures. Through experiments on eight synthesizers across three datasets, the authors show that methods explicitly modeling tabular structure, such as SynthPop and PCC, deliver superior performance, particularly on smaller datasets, and that PCC-based metrics provide robust, coherent evaluation aligned with model-based surrogates. The framework offers practical guidance for metric selection, baseline design, and future metric development, with open-source implementations to facilitate adoption and extension.

Abstract

Tabular data is common yet typically incomplete, small in volume, and access-restricted due to privacy concerns. Synthetic data generation offers potential solutions. Many metrics exist for evaluating the quality of synthetic tabular data; however, we lack an objective, coherent interpretation of the many metrics. To address this issue, we propose an evaluation framework with a single, mathematical objective that posits that the synthetic data should be drawn from the same distribution as the observed data. Through various structural decomposition of the objective, this framework allows us to reason for the first time the completeness of any set of metrics, as well as unifies existing metrics, including those that stem from fidelity considerations, downstream application, and model-based approaches. Moreover, the framework motivates model-free baselines and a new spectrum of metrics. We evaluate structurally informed synthesizers and synthesizers powered by deep learning. Using our structured framework, we show that synthetic data generators that explicitly represent tabular structure outperform other methods, especially on smaller datasets.

Structured Evaluation of Synthetic Tabular Data

TL;DR

with

. It decomposes distributions into a spectrum of substructures (marginals, pairwise, leave-one-out conditionals, full joint, missingness) and links both model-free and model-based metrics, including a PCC-based surrogate, to these substructures. Through experiments on eight synthesizers across three datasets, the authors show that methods explicitly modeling tabular structure, such as SynthPop and PCC, deliver superior performance, particularly on smaller datasets, and that PCC-based metrics provide robust, coherent evaluation aligned with model-based surrogates. The framework offers practical guidance for metric selection, baseline design, and future metric development, with open-source implementations to facilitate adoption and extension.

Abstract

Paper Structure (25 sections, 3 equations, 7 figures, 2 tables)

This paper contains 25 sections, 3 equations, 7 figures, 2 tables.

Introduction
Related Work
Structured Evaluation Framework
The spectrum of structure
Baselines
Model-based / surrogate metrics
Experiment
Evaluation Results
Conclusion
Metrics
Marginal
Pairwise
LOO
Full joint
Missing
...and 10 more sections

Figures (7)

Figure 1: (A) A modern taxonomy of the evaluation metrics. Some of the naming conventions follow dankar2022multi. (B) A structured framework of evaluation metrics. The metrics are positions along a spectrum of structure, depending on the structure of distribution they target. This spectrum is applied to both model-free and model-based metrics. The bottom row shows where the metrics depicted in (A) are repositioned in the new framework.
Figure 2: Structured evaluation on datasets of different sizes. The scores shown are the model-free and PCC-based scores. The error bars are combined assuming independence. For the census data, we omitted DDPM and GReaT for their poor quality and computational cost, respectively.
Figure S1: Model-free and PCC-based evaluation on the student dataset.(A) model-free evaluation of the synthesizers; (B) PCC-based evaluation of the synthesizers; (C) model-free evaluation of the baselines; and (D) PCC-based evaluation of the baselines. The x-axis shows the metric groups by substructure. The y-axis is the average metric score of the metrics in the group. Error bars are standard error gathered across synthetic datasets.
Figure S2: Model-free and PCC-based evaluation on the expedia dataset. See caption of Figure \ref{['fig:student']}.
Figure S3: Model-free and PCC-based evaluation on the census dataset. See caption of Figure \ref{['fig:student']}.
...and 2 more figures

Structured Evaluation of Synthetic Tabular Data

TL;DR

Abstract

Structured Evaluation of Synthetic Tabular Data

Authors

TL;DR

Abstract

Table of Contents

Figures (7)