Table of Contents
Fetching ...

Sharp error bounds for imbalanced classification: how many examples in the minority class?

Anass Aghbalou, François Portier, Anne Sabourin

TL;DR

This work addresses the challenge of learning under severe class imbalance by focusing on the balanced risk $\mathcal{R}_p(g)$, which reweights minority and majority errors. It develops non-asymptotic bounds in the relative rarity regime $p\to 0$, showing that estimation errors scale with the effective minority sample size $np$ rather than $n$, and proves fast-rate results $O(1/(np))$ under a Bernstein condition. The authors establish a first deviation inequality for balanced risks over VC-type function classes and apply it to both constrained ERM and balanced $k$-NN, establishing consistency when $kp\to\infty$ and fast rates for ERM. Numerical experiments on synthetic data corroborate the theory, illustrating the learning frontier defined by $np$ and the practical benefits of class weighting for imbalanced classification tasks. These results advance theoretical guarantees in imbalanced learning and pave the way for robust, fast-converging methods in highly skewed domains.

Abstract

When dealing with imbalanced classification data, reweighting the loss function is a standard procedure allowing to equilibrate between the true positive and true negative rates within the risk measure. Despite significant theoretical work in this area, existing results do not adequately address a main challenge within the imbalanced classification framework, which is the negligible size of one class in relation to the full sample size and the need to rescale the risk function by a probability tending to zero. To address this gap, we present two novel contributions in the setting where the rare class probability approaches zero: (1) a non asymptotic fast rate probability bound for constrained balanced empirical risk minimization, and (2) a consistent upper bound for balanced nearest neighbors estimates. Our findings provide a clearer understanding of the benefits of class-weighting in realistic settings, opening new avenues for further research in this field.

Sharp error bounds for imbalanced classification: how many examples in the minority class?

TL;DR

This work addresses the challenge of learning under severe class imbalance by focusing on the balanced risk , which reweights minority and majority errors. It develops non-asymptotic bounds in the relative rarity regime , showing that estimation errors scale with the effective minority sample size rather than , and proves fast-rate results under a Bernstein condition. The authors establish a first deviation inequality for balanced risks over VC-type function classes and apply it to both constrained ERM and balanced -NN, establishing consistency when and fast rates for ERM. Numerical experiments on synthetic data corroborate the theory, illustrating the learning frontier defined by and the practical benefits of class weighting for imbalanced classification tasks. These results advance theoretical guarantees in imbalanced learning and pave the way for robust, fast-converging methods in highly skewed domains.

Abstract

When dealing with imbalanced classification data, reweighting the loss function is a standard procedure allowing to equilibrate between the true positive and true negative rates within the risk measure. Despite significant theoretical work in this area, existing results do not adequately address a main challenge within the imbalanced classification framework, which is the negligible size of one class in relation to the full sample size and the need to rescale the risk function by a probability tending to zero. To address this gap, we present two novel contributions in the setting where the rare class probability approaches zero: (1) a non asymptotic fast rate probability bound for constrained balanced empirical risk minimization, and (2) a consistent upper bound for balanced nearest neighbors estimates. Our findings provide a clearer understanding of the benefits of class-weighting in realistic settings, opening new avenues for further research in this field.
Paper Structure (23 sections, 20 theorems, 93 equations, 8 figures)

This paper contains 23 sections, 20 theorems, 93 equations, 8 figures.

Key Result

Theorem 3.2

Let $\mathcal{F}$ be of VC-type with constant envelope $U$ and parameters $(v,A)$. For any $n$ and $\delta$ such that we have with probability $1-2\delta$, for some universal explicit constant $K>0$.

Figures (8)

  • Figure 1: AM risk of the balanced $k$-NN (heatmap).
  • Figure 2: Excess balanced risk (log-scale) of logistic regression as a function of $n$, when $p=p_n\to 0$. Orange line: curve $1/np$. Blue area: inter-quantile range $[0.1,0.9]$.
  • Figure 3: Balanced accuracy heat map for the Breast dataset.
  • Figure 4: Balanced accuracy heat map for the Ionosphere dataset.
  • Figure 5: Balanced accuracy heat map for the Pima dataset.
  • ...and 3 more figures

Theorems & Definitions (33)

  • Definition 3.1
  • Theorem 3.2
  • Remark 3.1
  • Corollary 3.3
  • Remark 3.2
  • Theorem 3.4
  • Corollary 3.5
  • Lemma 4.1: Sufficient conditions for $\mathcal{H}$ to satisfy a Bernstein-condition
  • Example 4.1
  • Lemma 4.2: VC-property of $\mathcal{H}$
  • ...and 23 more