Table of Contents
Fetching ...

Synthetic Tabular Data Generation for Imbalanced Classification: The Surprising Effectiveness of an Overlap Class

Annie D'souza, Swetha M, Sunita Sarawagi

TL;DR

This work tackles imbalanced classification on tabular data by revealing that state-of-the-art deep generative models produce substantially poorer minority samples than majority samples when trained on imbalanced data. It introduces ORD, a three-way labeling strategy that identifies an overlap region $D_{01}$, trains conditional generators on ternary labels, and trains classifiers on balanced synthetic data comprising minority $D_1$ and clear majority $D_{00}$ while discarding $D_{01}$. ORD improves both the quality of synthetic minority samples and classifier performance, outperforming multiple baselines across eight real datasets and five generators/classifiers, with statistical significance. The approach is orthogonal to the generator type and scales to diverse tabular settings, offering a practical pathway to more accurate imbalanced classifications in real-world applications.

Abstract

Handling imbalance in class distribution when building a classifier over tabular data has been a problem of long-standing interest. One popular approach is augmenting the training dataset with synthetically generated data. While classical augmentation techniques were limited to linear interpolation of existing minority class examples, recently higher capacity deep generative models are providing greater promise. However, handling of imbalance in class distribution when building a deep generative model is also a challenging problem, that has not been studied as extensively as imbalanced classifier model training. We show that state-of-the-art deep generative models yield significantly lower-quality minority examples than majority examples. %In this paper, we start with the observation that imbalanced data training of generative models trained imbalanced dataset which under-represent the minority class. We propose a novel technique of converting the binary class labels to ternary class labels by introducing a class for the region where minority and majority distributions overlap. We show that just this pre-processing of the training set, significantly improves the quality of data generated spanning several state-of-the-art diffusion and GAN-based models. While training the classifier using synthetic data, we remove the overlap class from the training data and justify the reasons behind the enhanced accuracy. We perform extensive experiments on four real-life datasets, five different classifiers, and five generative models demonstrating that our method enhances not only the synthesizer performance of state-of-the-art models but also the classifier performance.

Synthetic Tabular Data Generation for Imbalanced Classification: The Surprising Effectiveness of an Overlap Class

TL;DR

This work tackles imbalanced classification on tabular data by revealing that state-of-the-art deep generative models produce substantially poorer minority samples than majority samples when trained on imbalanced data. It introduces ORD, a three-way labeling strategy that identifies an overlap region , trains conditional generators on ternary labels, and trains classifiers on balanced synthetic data comprising minority and clear majority while discarding . ORD improves both the quality of synthetic minority samples and classifier performance, outperforming multiple baselines across eight real datasets and five generators/classifiers, with statistical significance. The approach is orthogonal to the generator type and scales to diverse tabular settings, offering a practical pathway to more accurate imbalanced classifications in real-world applications.

Abstract

Handling imbalance in class distribution when building a classifier over tabular data has been a problem of long-standing interest. One popular approach is augmenting the training dataset with synthetically generated data. While classical augmentation techniques were limited to linear interpolation of existing minority class examples, recently higher capacity deep generative models are providing greater promise. However, handling of imbalance in class distribution when building a deep generative model is also a challenging problem, that has not been studied as extensively as imbalanced classifier model training. We show that state-of-the-art deep generative models yield significantly lower-quality minority examples than majority examples. %In this paper, we start with the observation that imbalanced data training of generative models trained imbalanced dataset which under-represent the minority class. We propose a novel technique of converting the binary class labels to ternary class labels by introducing a class for the region where minority and majority distributions overlap. We show that just this pre-processing of the training set, significantly improves the quality of data generated spanning several state-of-the-art diffusion and GAN-based models. While training the classifier using synthetic data, we remove the overlap class from the training data and justify the reasons behind the enhanced accuracy. We perform extensive experiments on four real-life datasets, five different classifiers, and five generative models demonstrating that our method enhances not only the synthesizer performance of state-of-the-art models but also the classifier performance.

Paper Structure

This paper contains 53 sections, 5 equations, 14 figures, 20 tables, 1 algorithm.

Figures (14)

  • Figure 1: The method of ORD and its application to generate synthetic data. Synthetic data is then used to train a better classifier compared to when no ORD is used. Steps in the overall algorithm: 1. To use k-fold training to identify the overlap in the validation set. This uncertainty is labelled as a third class label $D_{01}\xspace$. 2. Generate synthetic data using the three class dataset instead of binary class. The synthetic data quality is much better as the distribution is better reproduced by Generative models. 3. Train the final classifier with equal proportions of majority and minority synthetic data while discarding the overlapped region $D_{01}\xspace$ which makes learning the decision boundary easier.
  • Figure 2: CTabSyn for class conditional tabular data generation. We needed to make only a small change (highlighted in green) over existing TabSyn model to improve the quality of generations for imbalanced class distribution and benefit from our finer-grained class labels. Conditional diffusion is implemented by adding the true target embedding as input to denoiser for efficient sampling.
  • Figure 3: Visualisation of Synthetic data for ORD in 2D datasets. The first row shows data synthesized with ORD for different Synthesizers along with a clear indication of the overlap class $D_{01}\xspace$. The second row zooms only on sampled minority with wrong generations marked in pink from generators without ORD. The third row shows the same data but with ORD enhanced generators. The columns correspond to different generators. Particularly for the diffusion models (last two columns), ORD provides much fewer errors in generated examples than baseline diffusion models, which is already much higher quality than earlier GAN-based generators in columns 2 and 3.
  • Figure 4: MLE of dataset Adult with varying thresholds during ORD
  • Figure 5: MLE of dataset cardio with varying thresholds during ORD
  • ...and 9 more figures