Table of Contents
Fetching ...

Evaluating Synthetic Tabular Data Generated To Augment Small Sample Datasets

Javier Marin

TL;DR

This work tackles the challenge of evaluating synthetic tabular data generated from very small samples. It juxtaposes traditional global metrics (pMSE, MMD, cluster analysis) with topology-based summaries from persistent diagrams, introducing a normalized Bottleneck distance metric $M_B$ to bound topological similarity on a [0,1] scale. Across four low-sample datasets, the study reveals substantial inconsistencies between distributional metrics and topological measures, and it highlights instability in the proposed $M_B$ normalization, suggesting that no single metric reliably captures both distributional and structural similarity. The results argue for a multi-faceted evaluation framework and lay groundwork for integrating topological insights into synthetic data validation and potentially into GAN training, particularly for small datasets.

Abstract

This work proposes a method to evaluate synthetic tabular data generated to augment small sample datasets. While data augmentation techniques can increase sample counts for machine learning applications, traditional validation approaches fail when applied to extremely limited sample sizes. Our experiments across four datasets reveal significant inconsistencies between global metrics and topological measures, with statistical tests producing unreliable significance values due to insufficient sample sizes. We demonstrate that common metrics like propensity scoring and MMD often suggest similarity where fundamental topological differences exist. Our proposed normalized Bottleneck distance based metric provides complementary insights but suffers from high variability across experimental runs and occasional values exceeding theoretical bounds, showing inherent instability in topological approaches for very small datasets. These findings highlight the critical need for multi-faceted evaluation methodologies when validating synthetic data generated from limited samples, as no single metric reliably captures both distributional and structural similarity.

Evaluating Synthetic Tabular Data Generated To Augment Small Sample Datasets

TL;DR

This work tackles the challenge of evaluating synthetic tabular data generated from very small samples. It juxtaposes traditional global metrics (pMSE, MMD, cluster analysis) with topology-based summaries from persistent diagrams, introducing a normalized Bottleneck distance metric to bound topological similarity on a [0,1] scale. Across four low-sample datasets, the study reveals substantial inconsistencies between distributional metrics and topological measures, and it highlights instability in the proposed normalization, suggesting that no single metric reliably captures both distributional and structural similarity. The results argue for a multi-faceted evaluation framework and lay groundwork for integrating topological insights into synthetic data validation and potentially into GAN training, particularly for small datasets.

Abstract

This work proposes a method to evaluate synthetic tabular data generated to augment small sample datasets. While data augmentation techniques can increase sample counts for machine learning applications, traditional validation approaches fail when applied to extremely limited sample sizes. Our experiments across four datasets reveal significant inconsistencies between global metrics and topological measures, with statistical tests producing unreliable significance values due to insufficient sample sizes. We demonstrate that common metrics like propensity scoring and MMD often suggest similarity where fundamental topological differences exist. Our proposed normalized Bottleneck distance based metric provides complementary insights but suffers from high variability across experimental runs and occasional values exceeding theoretical bounds, showing inherent instability in topological approaches for very small datasets. These findings highlight the critical need for multi-faceted evaluation methodologies when validating synthetic data generated from limited samples, as no single metric reliably captures both distributional and structural similarity.
Paper Structure (19 sections, 23 equations, 2 figures, 8 tables, 1 algorithm)

This paper contains 19 sections, 23 equations, 2 figures, 8 tables, 1 algorithm.

Figures (2)

  • Figure 1: Model convergence comparison using a WGAN for low and large sample training data. Wassertein GAN model trained with batch size $M = 10$, $n_{critic} = 1$, clipping parameter, $c = 0.01$, $\alpha = 10^{-5}$, and RMSprop optimizer. Left plot: datasets with 9 samples. Right plot: dataset with 1055 samples.
  • Figure 2: The Rips complex on a point cloud $VR(X, \epsilon)$. a) A set of point cloud data $X$ b) Vietoris-Rips complex for $X$, $VR(X, \epsilon)$ c) $\mathbb{R}^2$ persistence simplicical complex as union of points, intervals, triangles, and higher dimensional analogues d) Betti numbers $\beta_0$ and $\beta_1$ for $H_0$ and $H_1$.

Theorems & Definitions (17)

  • Definition 2.1
  • Definition 3.1
  • Definition 3.2
  • Definition 3.3
  • Definition 3.4
  • Definition 3.5
  • Definition 3.6
  • Definition 3.7
  • Definition 3.8
  • Definition 3.9
  • ...and 7 more