Deep Learning Meets Oversampling: A Learning Framework to Handle Imbalanced Classification
Sukumar Kishanthan, Asela Hevapathige
TL;DR
This paper introduces AutoSMOTE, a deep learning framework that integrates synthetic minority oversampling into the training loop by learning a set of discrete decision criteria to govern data generation. The approach treats oversampling as an end-to-end process, using differentiable approximations (via MLPs and Gumbel-Softmax) to map data points to synthetic samples through multiple decision criteria (participation, $k$-NN, and aggregation function). The authors provide theoretical analysis based on universal approximation and VC-dimension to justify the design choices and compare two variants, AutoSMOTE$_{self}$ and AutoSMOTE$_{cohort}$, with AutoSMOTE$_{cohort}$ often delivering the best generalization and empirical performance. Extensive experiments across eight imbalanced datasets show superior performance against traditional and deep-learning baselines in terms of precision, recall, and F1, while maintaining reasonable training efficiency. The work lays a foundation for more interpretable and flexible oversampling strategies and suggests avenues for extending discrete decision criteria and grouping strategies in imbalanced learning.
Abstract
Despite extensive research spanning several decades, class imbalance is still considered a profound difficulty for both machine learning and deep learning models. While data oversampling is the foremost technique to address this issue, traditional sampling techniques are often decoupled from the training phase of the predictive model, resulting in suboptimal representations. To address this, we propose a novel learning framework that can generate synthetic data instances in a data-driven manner. The proposed framework formulates the oversampling process as a composition of discrete decision criteria, thereby enhancing the representation power of the model's learning process. Extensive experiments on the imbalanced classification task demonstrate the superiority of our framework over state-of-the-art algorithms.
