Table of Contents
Fetching ...

A supervised generative optimization approach for tabular data

Shinpei Nakamura-Sakai, Fadi Hamad, Saheed Obitayo, Vamsi K. Potluru

TL;DR

The paper tackles synthetic data generation for tabular data with a focus on downstream-task performance, addressing the limitations of unsupervised methods. It introduces SC-GOAT, a two-phase framework that (i) supervises individual synthesizers via task-aware hyperparameter tuning and (ii) learns a mixture of synthesizers through meta-learning to optimize downstream loss. Through experiments on Adult and credit-card datasets, SC-GOAT generally outperforms state-of-the-art baselines, demonstrating the value of task-aligned data synthesis and mixture modeling. The work suggests promising practical impact for privacy-preserving data augmentation and downstream predictive tasks, with future directions including privacy fidelity assessments and broader applicability to augmentation scenarios.

Abstract

Synthetic data generation has emerged as a crucial topic for financial institutions, driven by multiple factors, such as privacy protection and data augmentation. Many algorithms have been proposed for synthetic data generation but reaching the consensus on which method we should use for the specific data sets and use cases remains challenging. Moreover, the majority of existing approaches are ``unsupervised'' in the sense that they do not take into account the downstream task. To address these issues, this work presents a novel synthetic data generation framework. The framework integrates a supervised component tailored to the specific downstream task and employs a meta-learning approach to learn the optimal mixture distribution of existing synthetic distributions.

A supervised generative optimization approach for tabular data

TL;DR

The paper tackles synthetic data generation for tabular data with a focus on downstream-task performance, addressing the limitations of unsupervised methods. It introduces SC-GOAT, a two-phase framework that (i) supervises individual synthesizers via task-aware hyperparameter tuning and (ii) learns a mixture of synthesizers through meta-learning to optimize downstream loss. Through experiments on Adult and credit-card datasets, SC-GOAT generally outperforms state-of-the-art baselines, demonstrating the value of task-aligned data synthesis and mixture modeling. The work suggests promising practical impact for privacy-preserving data augmentation and downstream predictive tasks, with future directions including privacy fidelity assessments and broader applicability to augmentation scenarios.

Abstract

Synthetic data generation has emerged as a crucial topic for financial institutions, driven by multiple factors, such as privacy protection and data augmentation. Many algorithms have been proposed for synthetic data generation but reaching the consensus on which method we should use for the specific data sets and use cases remains challenging. Moreover, the majority of existing approaches are ``unsupervised'' in the sense that they do not take into account the downstream task. To address these issues, this work presents a novel synthetic data generation framework. The framework integrates a supervised component tailored to the specific downstream task and employs a meta-learning approach to learn the optimal mixture distribution of existing synthetic distributions.
Paper Structure (18 sections, 5 equations, 1 figure, 5 tables, 2 algorithms)

This paper contains 18 sections, 5 equations, 1 figure, 5 tables, 2 algorithms.

Figures (1)

  • Figure 1: Average downstream test AUC score for 10 experiments using XGBoost fitted on the generated data by each model in the untuned setup.