Restoring balance: principled under/oversampling of data for optimal classification
Emanuele Loffredo, Mauro Pastore, Simona Cocco, Rémi Monasson
TL;DR
This work develops a principled, high-dimensional framework to understand learning with class imbalance for linear classifiers by applying the replica method to discrete multi-state inputs. It derives exact asymptotic expressions for generalization metrics (e.g., ACC, BA, AUC) in terms of class sizes and the first/second moments of the data, and shows that mixed under/oversampling can optimally restore balance depending on data statistics. The findings reveal a balance-to-performance trade-off driven by the margin parameter and data geometry, with BA serving as a robust indicator of balanced performance and AUC largely insensitive to imbalance. The authors validate their theory on real datasets and extend the approach with practical sampling strategies, including RBM-based likelihood-informed sampling and deep-network experiments, highlighting potential pathways for improving imbalanced classification in practice.
Abstract
Class imbalance in real-world data poses a common bottleneck for machine learning tasks, since achieving good generalization on under-represented examples is often challenging. Mitigation strategies, such as under or oversampling the data depending on their abundances, are routinely proposed and tested empirically, but how they should adapt to the data statistics remains poorly understood. In this work, we determine exact analytical expressions of the generalization curves in the high-dimensional regime for linear classifiers (Support Vector Machines). We also provide a sharp prediction of the effects of under/oversampling strategies depending on class imbalance, first and second moments of the data, and the metrics of performance considered. We show that mixed strategies involving under and oversampling of data lead to performance improvement. Through numerical experiments, we show the relevance of our theoretical predictions on real datasets, on deeper architectures and with sampling strategies based on unsupervised probabilistic models.
