Table of Contents
Fetching ...

Deep Learning Meets Oversampling: A Learning Framework to Handle Imbalanced Classification

Sukumar Kishanthan, Asela Hevapathige

TL;DR

This paper introduces AutoSMOTE, a deep learning framework that integrates synthetic minority oversampling into the training loop by learning a set of discrete decision criteria to govern data generation. The approach treats oversampling as an end-to-end process, using differentiable approximations (via MLPs and Gumbel-Softmax) to map data points to synthetic samples through multiple decision criteria (participation, $k$-NN, and aggregation function). The authors provide theoretical analysis based on universal approximation and VC-dimension to justify the design choices and compare two variants, AutoSMOTE$_{self}$ and AutoSMOTE$_{cohort}$, with AutoSMOTE$_{cohort}$ often delivering the best generalization and empirical performance. Extensive experiments across eight imbalanced datasets show superior performance against traditional and deep-learning baselines in terms of precision, recall, and F1, while maintaining reasonable training efficiency. The work lays a foundation for more interpretable and flexible oversampling strategies and suggests avenues for extending discrete decision criteria and grouping strategies in imbalanced learning.

Abstract

Despite extensive research spanning several decades, class imbalance is still considered a profound difficulty for both machine learning and deep learning models. While data oversampling is the foremost technique to address this issue, traditional sampling techniques are often decoupled from the training phase of the predictive model, resulting in suboptimal representations. To address this, we propose a novel learning framework that can generate synthetic data instances in a data-driven manner. The proposed framework formulates the oversampling process as a composition of discrete decision criteria, thereby enhancing the representation power of the model's learning process. Extensive experiments on the imbalanced classification task demonstrate the superiority of our framework over state-of-the-art algorithms.

Deep Learning Meets Oversampling: A Learning Framework to Handle Imbalanced Classification

TL;DR

This paper introduces AutoSMOTE, a deep learning framework that integrates synthetic minority oversampling into the training loop by learning a set of discrete decision criteria to govern data generation. The approach treats oversampling as an end-to-end process, using differentiable approximations (via MLPs and Gumbel-Softmax) to map data points to synthetic samples through multiple decision criteria (participation, -NN, and aggregation function). The authors provide theoretical analysis based on universal approximation and VC-dimension to justify the design choices and compare two variants, AutoSMOTE and AutoSMOTE, with AutoSMOTE often delivering the best generalization and empirical performance. Extensive experiments across eight imbalanced datasets show superior performance against traditional and deep-learning baselines in terms of precision, recall, and F1, while maintaining reasonable training efficiency. The work lays a foundation for more interpretable and flexible oversampling strategies and suggests avenues for extending discrete decision criteria and grouping strategies in imbalanced learning.

Abstract

Despite extensive research spanning several decades, class imbalance is still considered a profound difficulty for both machine learning and deep learning models. While data oversampling is the foremost technique to address this issue, traditional sampling techniques are often decoupled from the training phase of the predictive model, resulting in suboptimal representations. To address this, we propose a novel learning framework that can generate synthetic data instances in a data-driven manner. The proposed framework formulates the oversampling process as a composition of discrete decision criteria, thereby enhancing the representation power of the model's learning process. Extensive experiments on the imbalanced classification task demonstrate the superiority of our framework over state-of-the-art algorithms.

Paper Structure

This paper contains 36 sections, 2 theorems, 22 equations, 5 figures, 2 tables.

Key Result

Theorem 1

(Universal Approximation Theorem) For $G \subset \mathbb{R}^n$, we define $R(G)$ as the set of all continuous functions from $G$ to $\mathbb{R}$: $R(G) = \{ f : G \to \mathbb{R} \mid f \text{ is continuous} \}$. Then, for any $f \in R(G)$ and for any $\epsilon > 0$, there exists a multi-layer percep

Figures (5)

  • Figure 1: AutoSMOTE$_{self}$
  • Figure 2: AutoSMOTE$_{cohort}$
  • Figure 4: Training Error Comparison of AutoSMOTE with MLP-Oversampler
  • Figure 5: Testing Performance Comparison of AutoSMOTE with MLP-Oversampler.
  • Figure 6: Training Time Comparison for Oversampling Algorithms

Theorems & Definitions (6)

  • Definition 1: Decision criterion
  • Definition 2
  • Theorem 1
  • Definition 3: VC Dimension
  • Definition 4
  • Theorem 2