Sharp error bounds for imbalanced classification: how many examples in the minority class?
Anass Aghbalou, François Portier, Anne Sabourin
TL;DR
This work addresses the challenge of learning under severe class imbalance by focusing on the balanced risk $\mathcal{R}_p(g)$, which reweights minority and majority errors. It develops non-asymptotic bounds in the relative rarity regime $p\to 0$, showing that estimation errors scale with the effective minority sample size $np$ rather than $n$, and proves fast-rate results $O(1/(np))$ under a Bernstein condition. The authors establish a first deviation inequality for balanced risks over VC-type function classes and apply it to both constrained ERM and balanced $k$-NN, establishing consistency when $kp\to\infty$ and fast rates for ERM. Numerical experiments on synthetic data corroborate the theory, illustrating the learning frontier defined by $np$ and the practical benefits of class weighting for imbalanced classification tasks. These results advance theoretical guarantees in imbalanced learning and pave the way for robust, fast-converging methods in highly skewed domains.
Abstract
When dealing with imbalanced classification data, reweighting the loss function is a standard procedure allowing to equilibrate between the true positive and true negative rates within the risk measure. Despite significant theoretical work in this area, existing results do not adequately address a main challenge within the imbalanced classification framework, which is the negligible size of one class in relation to the full sample size and the need to rescale the risk function by a probability tending to zero. To address this gap, we present two novel contributions in the setting where the rare class probability approaches zero: (1) a non asymptotic fast rate probability bound for constrained balanced empirical risk minimization, and (2) a consistent upper bound for balanced nearest neighbors estimates. Our findings provide a clearer understanding of the benefits of class-weighting in realistic settings, opening new avenues for further research in this field.
