Table of Contents
Fetching ...

Rebalancing the Scales: A Systematic Mapping Study of Generative Adversarial Networks (GANs) in Addressing Data Imbalance

Pankaj Yadav, Gulshan Sihag, Vivek Vijay

TL;DR

This paper conducts a systematic mapping of GAN-based sampling techniques for imbalanced data by surveying 3041 papers and refining to 100 key studies across healthcare, finance, cybersecurity, and mixed domains. It introduces three categorization mappings—application domains, GAN techniques, and GAN variants—while analyzing data-level strategies, performance metrics, and publication venues. The findings highlight GAN-based oversampling as an effective preprocessing approach, with CTGAN and CGAN showing strong performance on structured/tabular data, and a clear uptick in hybrid methods in recent years. A notable gap is the limited exploration of hybridized GAN frameworks with diffusion models or reinforcement learning, pointing to future research directions. Overall, the work provides a comprehensive roadmap for researchers and practitioners to navigate GANs for imbalanced data and informs where further methodological advances are most needed.

Abstract

Machine learning algorithms are used in diverse domains, many of which face significant challenges due to data imbalance. Studies have explored various approaches to address the issue, like data preprocessing, cost-sensitive learning, and ensemble methods. Generative Adversarial Networks (GANs) showed immense potential as a data preprocessing technique that generates good quality synthetic data. This study employs a systematic mapping methodology to analyze 3041 papers on GAN-based sampling techniques for imbalanced data sourced from four digital libraries. A filtering process identified 100 key studies spanning domains such as healthcare, finance, and cybersecurity. Through comprehensive quantitative analysis, this research introduces three categorization mappings as application domains, GAN techniques, and GAN variants used to handle the imbalanced nature of the data. GAN-based over-sampling emerges as an effective preprocessing method. Advanced architectures and tailored frameworks helped GANs to improve further in the case of data imbalance. GAN variants like vanilla GAN, CTGAN, and CGAN show great adaptability in structured imbalanced data cases. Interest in GANs for imbalanced data has grown tremendously, touching a peak in recent years, with journals and conferences playing crucial roles in transmitting foundational theories and practical applications. While with these advances, none of the reviewed studies explicitly explore hybridized GAN frameworks with diffusion models or reinforcement learning techniques. This gap leads to a future research idea develop innovative approaches for effectively handling data imbalance.

Rebalancing the Scales: A Systematic Mapping Study of Generative Adversarial Networks (GANs) in Addressing Data Imbalance

TL;DR

This paper conducts a systematic mapping of GAN-based sampling techniques for imbalanced data by surveying 3041 papers and refining to 100 key studies across healthcare, finance, cybersecurity, and mixed domains. It introduces three categorization mappings—application domains, GAN techniques, and GAN variants—while analyzing data-level strategies, performance metrics, and publication venues. The findings highlight GAN-based oversampling as an effective preprocessing approach, with CTGAN and CGAN showing strong performance on structured/tabular data, and a clear uptick in hybrid methods in recent years. A notable gap is the limited exploration of hybridized GAN frameworks with diffusion models or reinforcement learning, pointing to future research directions. Overall, the work provides a comprehensive roadmap for researchers and practitioners to navigate GANs for imbalanced data and informs where further methodological advances are most needed.

Abstract

Machine learning algorithms are used in diverse domains, many of which face significant challenges due to data imbalance. Studies have explored various approaches to address the issue, like data preprocessing, cost-sensitive learning, and ensemble methods. Generative Adversarial Networks (GANs) showed immense potential as a data preprocessing technique that generates good quality synthetic data. This study employs a systematic mapping methodology to analyze 3041 papers on GAN-based sampling techniques for imbalanced data sourced from four digital libraries. A filtering process identified 100 key studies spanning domains such as healthcare, finance, and cybersecurity. Through comprehensive quantitative analysis, this research introduces three categorization mappings as application domains, GAN techniques, and GAN variants used to handle the imbalanced nature of the data. GAN-based over-sampling emerges as an effective preprocessing method. Advanced architectures and tailored frameworks helped GANs to improve further in the case of data imbalance. GAN variants like vanilla GAN, CTGAN, and CGAN show great adaptability in structured imbalanced data cases. Interest in GANs for imbalanced data has grown tremendously, touching a peak in recent years, with journals and conferences playing crucial roles in transmitting foundational theories and practical applications. While with these advances, none of the reviewed studies explicitly explore hybridized GAN frameworks with diffusion models or reinforcement learning techniques. This gap leads to a future research idea develop innovative approaches for effectively handling data imbalance.

Paper Structure

This paper contains 17 sections, 4 figures, 5 tables.

Figures (4)

  • Figure 1: An overview of the workflow of imbalanced data and GANs
  • Figure 2: Prisma
  • Figure 3: Categorization of variants introduced or analyzed in the examined articles by their corresponding keys
  • Figure 4: Categorization of studies and publications introduced or analyzed in the examined articles