Table of Contents
Fetching ...

Targeted synthetic data generation for tabular data via hardness characterization

Tommaso Ferracci, Leonie Tabea Goldmann, Anton Hinel, Francesco Sanna Passino

TL;DR

Problems: improve generalization in tabular binary classification under data limitations using synthetic data. Approach: a two-step pipeline using hardness characterization with KNN Shapleys to identify the hardest training points and train synthetic data generators (TVAE/CTGAN) only on those points, then augment. Findings: KNN Shapley-based hardness detection is competitive with state-of-the-art methods and far cheaper to compute; targeted augmentation yields larger out-of-sample gains than non-targeted augmentation, demonstrated on the Amex dataset and in simulations and UC Irvine benchmarks. Significance: provides a scalable, model-agnostic framework linking data valuation to targeted data generation, enabling more data-efficient learning in tabular domains; reproducibility resources are provided.

Abstract

Data augmentation via synthetic data generation has been shown to be effective in improving model performance and robustness in the context of scarce or low-quality data. Using the data valuation framework to statistically identify beneficial and detrimental observations, we introduce a simple augmentation pipeline that generates only high-value training points based on hardness characterization, in a computationally efficient manner. We first empirically demonstrate via benchmarks on real data that Shapley-based data valuation methods perform comparably with learning-based methods in hardness characterization tasks, while offering significant computational advantages. Then, we show that synthetic data generators trained on the hardest points outperform non-targeted data augmentation on a number of tabular datasets. Our approach improves the quality of out-of-sample predictions and it is computationally more efficient compared to non-targeted methods.

Targeted synthetic data generation for tabular data via hardness characterization

TL;DR

Problems: improve generalization in tabular binary classification under data limitations using synthetic data. Approach: a two-step pipeline using hardness characterization with KNN Shapleys to identify the hardest training points and train synthetic data generators (TVAE/CTGAN) only on those points, then augment. Findings: KNN Shapley-based hardness detection is competitive with state-of-the-art methods and far cheaper to compute; targeted augmentation yields larger out-of-sample gains than non-targeted augmentation, demonstrated on the Amex dataset and in simulations and UC Irvine benchmarks. Significance: provides a scalable, model-agnostic framework linking data valuation to targeted data generation, enabling more data-efficient learning in tabular domains; reproducibility resources are provided.

Abstract

Data augmentation via synthetic data generation has been shown to be effective in improving model performance and robustness in the context of scarce or low-quality data. Using the data valuation framework to statistically identify beneficial and detrimental observations, we introduce a simple augmentation pipeline that generates only high-value training points based on hardness characterization, in a computationally efficient manner. We first empirically demonstrate via benchmarks on real data that Shapley-based data valuation methods perform comparably with learning-based methods in hardness characterization tasks, while offering significant computational advantages. Then, we show that synthetic data generators trained on the hardest points outperform non-targeted data augmentation on a number of tabular datasets. Our approach improves the quality of out-of-sample predictions and it is computationally more efficient compared to non-targeted methods.
Paper Structure (22 sections, 3 equations, 20 figures, 8 tables)

This paper contains 22 sections, 3 equations, 20 figures, 8 tables.

Figures (20)

  • Figure 1: Visual representation of the proposed targeted synthetic data generation pipeline.
  • Figure 2: Results for the experiment in Section \ref{['sec:toy_example']}, based on a mixture of two normals with unit variance and means $-1$ and $1$, and training data $\{(-1,0),(1,1),(x_{\mathrm{train}},0)\}$.
  • Figure 3: Gini scores under different synthetic data augmentation regimes on the bivariate Gaussian simulated data, with scatterplots of observations and 5% hardest data points.
  • Figure 3: TVAE, hard 10%: tuning setup.
  • Figure 4: Distribution of the 100NN data Shapley scores for different Data-IQ tags.
  • ...and 15 more figures