Table of Contents
Fetching ...

Benchmarking Synthetic Tabular Data: A Multi-Dimensional Evaluation Framework

Andrey Sidorenko, Michael Platzer, Mario Scriminaci, Paul Tiwald

TL;DR

The paper addresses the challenge of evaluating synthetic tabular data by balancing fidelity and privacy. It introduces a holdout-based framework with three metric families—Accuracy, Centroid Similarity, and Distances—that jointly assess low- and high-dimensional fidelity, embedding-based distributional similarity, and novelty. By handling mixed-type and sequential/contextual data, and providing open-source tooling (mostlyai-qa), the approach enables reproducible benchmarking and interpretable quality diagnostics. The framework supports comparisons across synthesizers and clarifies the trade-offs between utility and privacy, with practical impact for researchers and practitioners seeking standardized evaluation of synthetic data pipelines, under a north-star reference of $(1,1)$ on holdout-based benchmarks.

Abstract

Evaluating the quality of synthetic data remains a key challenge for ensuring privacy and utility in data-driven research. In this work, we present an evaluation framework that quantifies how well synthetic data replicates original distributional properties while ensuring privacy. The proposed approach employs a holdout-based benchmarking strategy that facilitates quantitative assessment through low- and high-dimensional distribution comparisons, embedding-based similarity measures, and nearest-neighbor distance metrics. The framework supports various data types and structures, including sequential and contextual information, and enables interpretable quality diagnostics through a set of standardized metrics. These contributions aim to support reproducibility and methodological consistency in benchmarking of synthetic data generation techniques. The code of the framework is available at https://github.com/mostly-ai/mostlyai-qa.

Benchmarking Synthetic Tabular Data: A Multi-Dimensional Evaluation Framework

TL;DR

The paper addresses the challenge of evaluating synthetic tabular data by balancing fidelity and privacy. It introduces a holdout-based framework with three metric families—Accuracy, Centroid Similarity, and Distances—that jointly assess low- and high-dimensional fidelity, embedding-based distributional similarity, and novelty. By handling mixed-type and sequential/contextual data, and providing open-source tooling (mostlyai-qa), the approach enables reproducible benchmarking and interpretable quality diagnostics. The framework supports comparisons across synthesizers and clarifies the trade-offs between utility and privacy, with practical impact for researchers and practitioners seeking standardized evaluation of synthetic data pipelines, under a north-star reference of on holdout-based benchmarks.

Abstract

Evaluating the quality of synthetic data remains a key challenge for ensuring privacy and utility in data-driven research. In this work, we present an evaluation framework that quantifies how well synthetic data replicates original distributional properties while ensuring privacy. The proposed approach employs a holdout-based benchmarking strategy that facilitates quantitative assessment through low- and high-dimensional distribution comparisons, embedding-based similarity measures, and nearest-neighbor distance metrics. The framework supports various data types and structures, including sequential and contextual information, and enables interpretable quality diagnostics through a set of standardized metrics. These contributions aim to support reproducibility and methodological consistency in benchmarking of synthetic data generation techniques. The code of the framework is available at https://github.com/mostly-ai/mostlyai-qa.

Paper Structure

This paper contains 9 sections, 13 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: An example of metrics summary generated by the framework.
  • Figure 2: An example of univariate distributions and their accuracies generated by the framework.
  • Figure 3: Bivariate distributions and their accuracies generated by the framework.
  • Figure 4: An example of coherence distributions and their accuracies generated by the framework.
  • Figure 5: Similarity within PCA-projected embedding space generated by the framework.
  • ...and 2 more figures