Synthetic Tabular Data Generation for Class Imbalance and Fairness: A Comparative Study
Emmanouil Panagiotou, Arjun Roy, Eirini Ntoutsi
TL;DR
This study tackles bias arising from class and group imbalance in tabular data by evaluating model-independent generative data augmentation methods. It compares SDV-GC, CTGAN, TVAE, CART, and SMOTE-NC across four real-world datasets using four sampling strategies, with XGBoost as the downstream classifier. Key findings show that the non-parametric CART method often delivers the best fairness-utility trade-offs and requires relatively little augmentation, while class-ratio sampling can boost fairness with fewer synthetic samples. The work demonstrates the potential of generative tabular data to mitigate bias, provides practical guidance on sampling strategies, and makes the codebase publicly available for reproducibility and further research.
Abstract
Due to their data-driven nature, Machine Learning (ML) models are susceptible to bias inherited from data, especially in classification problems where class and group imbalances are prevalent. Class imbalance (in the classification target) and group imbalance (in protected attributes like sex or race) can undermine both ML utility and fairness. Although class and group imbalances commonly coincide in real-world tabular datasets, limited methods address this scenario. While most methods use oversampling techniques, like interpolation, to mitigate imbalances, recent advancements in synthetic tabular data generation offer promise but have not been adequately explored for this purpose. To this end, this paper conducts a comparative analysis to address class and group imbalances using state-of-the-art models for synthetic tabular data generation and various sampling strategies. Experimental results on four datasets, demonstrate the effectiveness of generative models for bias mitigation, creating opportunities for further exploration in this direction.
