Table of Contents
Fetching ...

Balancing the Scales: A Theoretical and Algorithmic Framework for Learning from Imbalanced Data

Corinna Cortes, Anqi Mao, Mehryar Mohri, Yutao Zhong

TL;DR

This work tackles the pervasive problem of class imbalance in binary and multi-class learning by introducing a principled theoretical framework based on a class-imbalanced margin loss and strong $\mathscr{H}$-consistency guarantees. It defines the $(\rho_{+},\rho_{-})$-margin loss, develops margin-based generalization bounds using class-sensitive Rademacher complexity, and presents IMMAX, an Imbalanced Margin Maximization algorithm that extends to neural networks and general hypothesis sets. The approach is extended to multi-class settings with vector margins and corresponding risk bounds, and it is shown that common resampling and cost-sensitive methods lack Bayes-consistency under the standard misclassification loss. Empirically, IMMAX consistently outperforms a range of baselines on long-tailed and step-imbalanced CIFAR-10/100 and Tiny ImageNet, validating the theoretical guarantees and practical impact for robust, principled handling of imbalance.

Abstract

Class imbalance remains a major challenge in machine learning, especially in multi-class problems with long-tailed distributions. Existing methods, such as data resampling, cost-sensitive techniques, and logistic loss modifications, though popular and often effective, lack solid theoretical foundations. As an example, we demonstrate that cost-sensitive methods are not Bayes-consistent. This paper introduces a novel theoretical framework for analyzing generalization in imbalanced classification. We propose a new class-imbalanced margin loss function for both binary and multi-class settings, prove its strong $H$-consistency, and derive corresponding learning guarantees based on empirical loss and a new notion of class-sensitive Rademacher complexity. Leveraging these theoretical results, we devise novel and general learning algorithms, IMMAX (Imbalanced Margin Maximization), which incorporate confidence margins and are applicable to various hypothesis sets. While our focus is theoretical, we also present extensive empirical results demonstrating the effectiveness of our algorithms compared to existing baselines.

Balancing the Scales: A Theoretical and Algorithmic Framework for Learning from Imbalanced Data

TL;DR

This work tackles the pervasive problem of class imbalance in binary and multi-class learning by introducing a principled theoretical framework based on a class-imbalanced margin loss and strong -consistency guarantees. It defines the -margin loss, develops margin-based generalization bounds using class-sensitive Rademacher complexity, and presents IMMAX, an Imbalanced Margin Maximization algorithm that extends to neural networks and general hypothesis sets. The approach is extended to multi-class settings with vector margins and corresponding risk bounds, and it is shown that common resampling and cost-sensitive methods lack Bayes-consistency under the standard misclassification loss. Empirically, IMMAX consistently outperforms a range of baselines on long-tailed and step-imbalanced CIFAR-10/100 and Tiny ImageNet, validating the theoretical guarantees and practical impact for robust, principled handling of imbalance.

Abstract

Class imbalance remains a major challenge in machine learning, especially in multi-class problems with long-tailed distributions. Existing methods, such as data resampling, cost-sensitive techniques, and logistic loss modifications, though popular and often effective, lack solid theoretical foundations. As an example, we demonstrate that cost-sensitive methods are not Bayes-consistent. This paper introduces a novel theoretical framework for analyzing generalization in imbalanced classification. We propose a new class-imbalanced margin loss function for both binary and multi-class settings, prove its strong -consistency, and derive corresponding learning guarantees based on empirical loss and a new notion of class-sensitive Rademacher complexity. Leveraging these theoretical results, we devise novel and general learning algorithms, IMMAX (Imbalanced Margin Maximization), which incorporate confidence margins and are applicable to various hypothesis sets. While our focus is theoretical, we also present extensive empirical results demonstrating the effectiveness of our algorithms compared to existing baselines.

Paper Structure

This paper contains 35 sections, 27 theorems, 103 equations, 1 figure, 3 tables.

Key Result

Lemma 3.1

The class-imbalanced $(\rho_{+}, \rho_{-})$-margin loss function can be equivalently expressed as follows:

Figures (1)

  • Figure 1: Solutions in the separable case. Left: Empirical data with negative (blue) and positive (orange) points. The black line is the SVM solution, the red dashed line is cao2019learning's solution, and the blue dashed line is ours. Right: Full data distribution showing our solution achieves the lowest generalization error.

Theorems & Definitions (45)

  • Definition 3.1: Class-imbalanced margin loss function
  • Lemma 3.1
  • Theorem 3.2: $\sH$-consistency bound for class-imbalanced margin loss
  • Definition 3.3: $(\rho_{+}, \rho_{-})$--class-sensitive Rademacher complexity
  • Theorem 3.4: Margin bound for imbalanced binary classification
  • Theorem 4.1
  • Definition 5.1: Multi-class class-imbalanced margin loss
  • Lemma 5.1
  • Theorem 5.2: $\sH$-Consistency bound for multi-class class-imbalanced margin loss
  • Definition 5.3: ${\boldsymbol \rho}$-class-sensitive Rademacher complexity
  • ...and 35 more