Table of Contents
Fetching ...

Towards a framework on tabular synthetic data generation: a minimalist approach: theory, use cases, and limitations

Yueyang Shen, Agus Sudjianto, Arun Prakash R, Anwesha Bhattacharyya, Maorong Rao, Yaqun Wang, Joel Vaughan, Nengfeng Zhou

TL;DR

A minimalist approach towards synthetic tabular data generation that is simplistic, guarantees interpretability all the way through, does not require extra tuning and provide unique benefits is proposed and applied to robustness testing.

Abstract

We propose and study a minimalist approach towards synthetic tabular data generation. The model consists of a minimalistic unsupervised SparsePCA encoder (with contingent clustering step or log transformation to handle nonlinearity) and XGboost decoder which is SOTA for structured data regression and classification tasks. We study and contrast the methodologies with (variational) autoencoders in several toy low dimensional scenarios to derive necessary intuitions. The framework is applied to high dimensional simulated credit scoring data which parallels real-life financial applications. We applied the method to robustness testing to demonstrate practical use cases. The case study result suggests that the method provides an alternative to raw and quantile perturbation for model robustness testing. We show that the method is simplistic, guarantees interpretability all the way through, does not require extra tuning and provide unique benefits.

Towards a framework on tabular synthetic data generation: a minimalist approach: theory, use cases, and limitations

TL;DR

A minimalist approach towards synthetic tabular data generation that is simplistic, guarantees interpretability all the way through, does not require extra tuning and provide unique benefits is proposed and applied to robustness testing.

Abstract

We propose and study a minimalist approach towards synthetic tabular data generation. The model consists of a minimalistic unsupervised SparsePCA encoder (with contingent clustering step or log transformation to handle nonlinearity) and XGboost decoder which is SOTA for structured data regression and classification tasks. We study and contrast the methodologies with (variational) autoencoders in several toy low dimensional scenarios to derive necessary intuitions. The framework is applied to high dimensional simulated credit scoring data which parallels real-life financial applications. We applied the method to robustness testing to demonstrate practical use cases. The case study result suggests that the method provides an alternative to raw and quantile perturbation for model robustness testing. We show that the method is simplistic, guarantees interpretability all the way through, does not require extra tuning and provide unique benefits.

Paper Structure

This paper contains 33 sections, 13 equations, 17 figures, 2 tables, 1 algorithm.

Figures (17)

  • Figure 1: The workflow and landscape of latent representation learning for synthetic data generation.
  • Figure 2: The robustness analysis pipeline for computational tractability in perturbation and to facilitate robustness analysis.
  • Figure 3: Schematic for the half circle example. The red dashed line $f(z)$ resembles the underlying ground truth. The black scatter plots are the observed data. The blue distribution resembles the distributional modeling for $p(x\lvert z)$. The orange line explicates the mechanism for sparsePCA which is projecting onto the $x$-axis in the half circle data, and the green dashed line resembles the XGboost reconstruction. The cyan line $\hat{f}(z)$ means the fitted mapping from latent to original space data.
  • Figure 4: (left). A failure mode for autoencoders: no uniform sampling from 0 to $\pi$. In the example demonstrated, a sparse region was provided in $\theta\in(\frac{\pi}{3},\frac{2\pi}{3})$. This case autoencoders underperforms compared to the proposed method. (right). Characterizing the possible low dimensional failure modes for the pipeline possibly caused by the linear subspace encoder learning from PCA. As illustrated from the circle and cuboids example, the singular dimension that is discarded contains significant information in both examples. In compactified circle example, the $y$ direction contains same information as $x$ and discarding either $x$ or $y$ kills the structure of the data in reconstruction. In the cuboid (surface) example, the smallest $x$ dimension is projected out and this was followed by the misconstruction of four facets. The left and right sides are collapsed onto one plane and lower facets are collapsed into a line.
  • Figure 5: The flow-based landscape of latent representation learning for synthetic data generation.
  • ...and 12 more figures