Table of Contents
Fetching ...

Restoring balance: principled under/oversampling of data for optimal classification

Emanuele Loffredo, Mauro Pastore, Simona Cocco, Rémi Monasson

TL;DR

This work develops a principled, high-dimensional framework to understand learning with class imbalance for linear classifiers by applying the replica method to discrete multi-state inputs. It derives exact asymptotic expressions for generalization metrics (e.g., ACC, BA, AUC) in terms of class sizes and the first/second moments of the data, and shows that mixed under/oversampling can optimally restore balance depending on data statistics. The findings reveal a balance-to-performance trade-off driven by the margin parameter and data geometry, with BA serving as a robust indicator of balanced performance and AUC largely insensitive to imbalance. The authors validate their theory on real datasets and extend the approach with practical sampling strategies, including RBM-based likelihood-informed sampling and deep-network experiments, highlighting potential pathways for improving imbalanced classification in practice.

Abstract

Class imbalance in real-world data poses a common bottleneck for machine learning tasks, since achieving good generalization on under-represented examples is often challenging. Mitigation strategies, such as under or oversampling the data depending on their abundances, are routinely proposed and tested empirically, but how they should adapt to the data statistics remains poorly understood. In this work, we determine exact analytical expressions of the generalization curves in the high-dimensional regime for linear classifiers (Support Vector Machines). We also provide a sharp prediction of the effects of under/oversampling strategies depending on class imbalance, first and second moments of the data, and the metrics of performance considered. We show that mixed strategies involving under and oversampling of data lead to performance improvement. Through numerical experiments, we show the relevance of our theoretical predictions on real datasets, on deeper architectures and with sampling strategies based on unsupervised probabilistic models.

Restoring balance: principled under/oversampling of data for optimal classification

TL;DR

This work develops a principled, high-dimensional framework to understand learning with class imbalance for linear classifiers by applying the replica method to discrete multi-state inputs. It derives exact asymptotic expressions for generalization metrics (e.g., ACC, BA, AUC) in terms of class sizes and the first/second moments of the data, and shows that mixed under/oversampling can optimally restore balance depending on data statistics. The findings reveal a balance-to-performance trade-off driven by the margin parameter and data geometry, with BA serving as a robust indicator of balanced performance and AUC largely insensitive to imbalance. The authors validate their theory on real datasets and extend the approach with practical sampling strategies, including RBM-based likelihood-informed sampling and deep-network experiments, highlighting potential pathways for improving imbalanced classification in practice.

Abstract

Class imbalance in real-world data poses a common bottleneck for machine learning tasks, since achieving good generalization on under-represented examples is often challenging. Mitigation strategies, such as under or oversampling the data depending on their abundances, are routinely proposed and tested empirically, but how they should adapt to the data statistics remains poorly understood. In this work, we determine exact analytical expressions of the generalization curves in the high-dimensional regime for linear classifiers (Support Vector Machines). We also provide a sharp prediction of the effects of under/oversampling strategies depending on class imbalance, first and second moments of the data, and the metrics of performance considered. We show that mixed strategies involving under and oversampling of data lead to performance improvement. Through numerical experiments, we show the relevance of our theoretical predictions on real datasets, on deeper architectures and with sampling strategies based on unsupervised probabilistic models.
Paper Structure (46 sections, 1 theorem, 76 equations, 9 figures, 3 tables)

This paper contains 46 sections, 1 theorem, 76 equations, 9 figures, 3 tables.

Key Result

Proposition 2.1

Consider the ERM problem in Eq. eq:minimization-Jb under the assumption of training data distribution eq:Data. Let the fractions $\varphi^{\pm}$ be the composition of the test set and a performance metric of choice, with $\Delta^{\pm}$ defined as in Eq. eq:Deltas over test data points and $g$ a generic test loss function (see tab:metrics for examples). In the asymptotic regime $L, P, N \to \infty

Figures (9)

  • Figure 1: Illustration of our restoring balance procedure. Classifiers trained on datasets with severe imbalance (blue star) generally show poor generalization performances. Restoring balance by mixing under and oversampling improves classification performances (red stars across the line $P=N$). Here $P,N$ indicate the sizes of the positive and negative classes, initially with $P\ll N$.
  • Figure 2: Analytical results derived within our framework. a) Different metrics on synthetic data evaluated on a test set having same train set composition. Here $L=100$, $\kappa =2$, $\alpha^{-}=5$ with $C, \delta$ sampled randomly. b) Analytical predictions for benchmark datasets (MNIST, FashionMNIST and CelebA). Dots show numerical simulations, averaged over $10$ trials for $\kappa =0.5$. c) Balanced accuracy curves as a function of $\rho^-$. The margin $\kappa$ of the algorithm controls the balance-to-performance trade-off. Dots correspond to numerical simulations with $\alpha^+ = 2$, $Q=2$, $L=100$, position-independent $\vb{M}$, normally-distributed $\bm{\delta}$, and diagonal covariance $C$ averaged over $50$ trials.
  • Figure 3: Optimal mixing strategy. We report theoretical predictions for the BA metric as a function of the mixing under/oversampling percentage in the training set. Depending on the initial training set composition $(\rho^+, \rho^-)$, one can select the optimal strategy to restore balance. Here $L=100$, $\kappa =0.5$, with $C$ and $\bm{\delta}$ sampled randomly.
  • Figure 4: Numerical investigation on improved sampling techniques and deep classifier ResNet-50. a) Mixed sampling strategies to obtain a balanced training set for classification with a linear SVM on binary MNIST, as a function of the new sample size $P^{\prime}$. As random sampling techniques, also higher level methods lead to an increase of performance. b) Geometrical interpretation of the effect of restoring balance (gold line) on the decision boundary of a linear SVM compared to imbalanced ERM (purple line). Data points are the MNIST test set. c) We visualize test data classification in the last feature layer of the network through tSNE, for imbalanced and balanced training set (top and bottom, respectively). The network trained on balanced data achieves improved performances, separating better the two classes.
  • Figure 5: a) Behaviors of the order parameters $q$, $r$, $b$ for the data statistics in \ref{['fig:fig_theory']}a. The bias $b$ becomes steep around $\rho^{+} \sim 0.5$ and this effect is more evident the larger the value of $\kappa$ is. b) Additional metric behaviors defined in this work (see \ref{['tab:app_metrics', 'tab:metrics']}), for synthetic data as in \ref{['fig:fig_theory']}a.
  • ...and 4 more figures

Theorems & Definitions (1)

  • Proposition 2.1