Table of Contents
Fetching ...

Synthetic Tabular Data Generation for Class Imbalance and Fairness: A Comparative Study

Emmanouil Panagiotou, Arjun Roy, Eirini Ntoutsi

TL;DR

This study tackles bias arising from class and group imbalance in tabular data by evaluating model-independent generative data augmentation methods. It compares SDV-GC, CTGAN, TVAE, CART, and SMOTE-NC across four real-world datasets using four sampling strategies, with XGBoost as the downstream classifier. Key findings show that the non-parametric CART method often delivers the best fairness-utility trade-offs and requires relatively little augmentation, while class-ratio sampling can boost fairness with fewer synthetic samples. The work demonstrates the potential of generative tabular data to mitigate bias, provides practical guidance on sampling strategies, and makes the codebase publicly available for reproducibility and further research.

Abstract

Due to their data-driven nature, Machine Learning (ML) models are susceptible to bias inherited from data, especially in classification problems where class and group imbalances are prevalent. Class imbalance (in the classification target) and group imbalance (in protected attributes like sex or race) can undermine both ML utility and fairness. Although class and group imbalances commonly coincide in real-world tabular datasets, limited methods address this scenario. While most methods use oversampling techniques, like interpolation, to mitigate imbalances, recent advancements in synthetic tabular data generation offer promise but have not been adequately explored for this purpose. To this end, this paper conducts a comparative analysis to address class and group imbalances using state-of-the-art models for synthetic tabular data generation and various sampling strategies. Experimental results on four datasets, demonstrate the effectiveness of generative models for bias mitigation, creating opportunities for further exploration in this direction.

Synthetic Tabular Data Generation for Class Imbalance and Fairness: A Comparative Study

TL;DR

This study tackles bias arising from class and group imbalance in tabular data by evaluating model-independent generative data augmentation methods. It compares SDV-GC, CTGAN, TVAE, CART, and SMOTE-NC across four real-world datasets using four sampling strategies, with XGBoost as the downstream classifier. Key findings show that the non-parametric CART method often delivers the best fairness-utility trade-offs and requires relatively little augmentation, while class-ratio sampling can boost fairness with fewer synthetic samples. The work demonstrates the potential of generative tabular data to mitigate bias, provides practical guidance on sampling strategies, and makes the codebase publicly available for reproducibility and further research.

Abstract

Due to their data-driven nature, Machine Learning (ML) models are susceptible to bias inherited from data, especially in classification problems where class and group imbalances are prevalent. Class imbalance (in the classification target) and group imbalance (in protected attributes like sex or race) can undermine both ML utility and fairness. Although class and group imbalances commonly coincide in real-world tabular datasets, limited methods address this scenario. While most methods use oversampling techniques, like interpolation, to mitigate imbalances, recent advancements in synthetic tabular data generation offer promise but have not been adequately explored for this purpose. To this end, this paper conducts a comparative analysis to address class and group imbalances using state-of-the-art models for synthetic tabular data generation and various sampling strategies. Experimental results on four datasets, demonstrate the effectiveness of generative models for bias mitigation, creating opportunities for further exploration in this direction.
Paper Structure (17 sections, 2 figures, 5 tables)

This paper contains 17 sections, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Distributions of class and group imbalance for each real dataset (first column) along with final augmented dataset for each sampling strategy.
  • Figure 2: Sex, race, and class subgroup percentage distributions of the adult dataset.