Sharp error bounds for imbalanced classification: how many examples in the minority class?

Anass Aghbalou; François Portier; Anne Sabourin

Sharp error bounds for imbalanced classification: how many examples in the minority class?

Anass Aghbalou, François Portier, Anne Sabourin

TL;DR

This work addresses the challenge of learning under severe class imbalance by focusing on the balanced risk $\mathcal{R}_p(g)$, which reweights minority and majority errors. It develops non-asymptotic bounds in the relative rarity regime $p\to 0$, showing that estimation errors scale with the effective minority sample size $np$ rather than $n$, and proves fast-rate results $O(1/(np))$ under a Bernstein condition. The authors establish a first deviation inequality for balanced risks over VC-type function classes and apply it to both constrained ERM and balanced $k$-NN, establishing consistency when $kp\to\infty$ and fast rates for ERM. Numerical experiments on synthetic data corroborate the theory, illustrating the learning frontier defined by $np$ and the practical benefits of class weighting for imbalanced classification tasks. These results advance theoretical guarantees in imbalanced learning and pave the way for robust, fast-converging methods in highly skewed domains.

Abstract

When dealing with imbalanced classification data, reweighting the loss function is a standard procedure allowing to equilibrate between the true positive and true negative rates within the risk measure. Despite significant theoretical work in this area, existing results do not adequately address a main challenge within the imbalanced classification framework, which is the negligible size of one class in relation to the full sample size and the need to rescale the risk function by a probability tending to zero. To address this gap, we present two novel contributions in the setting where the rare class probability approaches zero: (1) a non asymptotic fast rate probability bound for constrained balanced empirical risk minimization, and (2) a consistent upper bound for balanced nearest neighbors estimates. Our findings provide a clearer understanding of the benefits of class-weighting in realistic settings, opening new avenues for further research in this field.

Sharp error bounds for imbalanced classification: how many examples in the minority class?

TL;DR

This work addresses the challenge of learning under severe class imbalance by focusing on the balanced risk

, which reweights minority and majority errors. It develops non-asymptotic bounds in the relative rarity regime

, showing that estimation errors scale with the effective minority sample size

rather than

, and proves fast-rate results

under a Bernstein condition. The authors establish a first deviation inequality for balanced risks over VC-type function classes and apply it to both constrained ERM and balanced

-NN, establishing consistency when

and fast rates for ERM. Numerical experiments on synthetic data corroborate the theory, illustrating the learning frontier defined by

and the practical benefits of class weighting for imbalanced classification tasks. These results advance theoretical guarantees in imbalanced learning and pave the way for robust, fast-converging methods in highly skewed domains.

Abstract

Paper Structure (23 sections, 20 theorems, 93 equations, 8 figures)

This paper contains 23 sections, 20 theorems, 93 equations, 8 figures.

Introduction
Definition and notation
Motivating examples
Standard learning rates under relative rarity
A First Deviation Inequality for Balanced Risks
Balanced $k$-Nearest Neighbor
Fast rates under relative rarity
Numerical illustration
Balanced $k$-Nearest Neighbors
Balanced ERM
Conclusion
Appendix
Auxiliary results
Standard rates proofs
Proof of Theorem \ref{['theo:VC-standard-rate']}
...and 8 more sections

Key Result

Theorem 3.2

Let $\mathcal{F}$ be of VC-type with constant envelope $U$ and parameters $(v,A)$. For any $n$ and $\delta$ such that we have with probability $1-2\delta$, for some universal explicit constant $K>0$.

Figures (8)

Figure 1: AM risk of the balanced $k$-NN (heatmap).
Figure 2: Excess balanced risk (log-scale) of logistic regression as a function of $n$, when $p=p_n\to 0$. Orange line: curve $1/np$. Blue area: inter-quantile range $[0.1,0.9]$.
Figure 3: Balanced accuracy heat map for the Breast dataset.
Figure 4: Balanced accuracy heat map for the Ionosphere dataset.
Figure 5: Balanced accuracy heat map for the Pima dataset.
...and 3 more figures

Theorems & Definitions (33)

Definition 3.1
Theorem 3.2
Remark 3.1
Corollary 3.3
Remark 3.2
Theorem 3.4
Corollary 3.5
Lemma 4.1: Sufficient conditions for $\mathcal{H}$ to satisfy a Bernstein-condition
Example 4.1
Lemma 4.2: VC-property of $\mathcal{H}$
...and 23 more

Sharp error bounds for imbalanced classification: how many examples in the minority class?

TL;DR

Abstract

Sharp error bounds for imbalanced classification: how many examples in the minority class?

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (33)