Table of Contents
Fetching ...

Learning Confidence Bounds for Classification with Imbalanced Data

Matt Clifford, Jonathan Erskine, Alexander Hepburn, Raúl Santos-Rodríguez, Dario Garcia-Garcia

TL;DR

This work tackles class imbalance by embedding class-specific confidence bounds into the bias term of a pre-trained binary classifier, grounded in concentration-inequality theory. It derives principled bounds and optimizes class-wise confidence levels to ensure non-overlapping class supports in a projected feature space, with a bias term adjusted at their intersection; slack variables are introduced to handle non-separable data. Empirical results on synthetic and real imbalanced datasets show the method often yields superior minority-class performance (G-Mean and F1) while maintaining broad applicability across classifiers, compared with SMOTE, thresholding, Bayes risk, and weighted approaches. The approach offers a theoretically principled, post-training adjustment that can be implemented with existing models and open-source tooling, with noted limitations when the learned representation poorly separates classes.

Abstract

Class imbalance poses a significant challenge in classification tasks, where traditional approaches often lead to biased models and unreliable predictions. Undersampling and oversampling techniques have been commonly employed to address this issue, yet they suffer from inherent limitations stemming from their simplistic approach such as loss of information and additional biases respectively. In this paper, we propose a novel framework that leverages learning theory and concentration inequalities to overcome the shortcomings of traditional solutions. We focus on understanding the uncertainty in a class-dependent manner, as captured by confidence bounds that we directly embed into the learning process. By incorporating class-dependent estimates, our method can effectively adapt to the varying degrees of imbalance across different classes, resulting in more robust and reliable classification outcomes. We empirically show how our framework provides a promising direction for handling imbalanced data in classification tasks, offering practitioners a valuable tool for building more accurate and trustworthy models.

Learning Confidence Bounds for Classification with Imbalanced Data

TL;DR

This work tackles class imbalance by embedding class-specific confidence bounds into the bias term of a pre-trained binary classifier, grounded in concentration-inequality theory. It derives principled bounds and optimizes class-wise confidence levels to ensure non-overlapping class supports in a projected feature space, with a bias term adjusted at their intersection; slack variables are introduced to handle non-separable data. Empirical results on synthetic and real imbalanced datasets show the method often yields superior minority-class performance (G-Mean and F1) while maintaining broad applicability across classifiers, compared with SMOTE, thresholding, Bayes risk, and weighted approaches. The approach offers a theoretically principled, post-training adjustment that can be implemented with existing models and open-source tooling, with noted limitations when the learned representation poorly separates classes.

Abstract

Class imbalance poses a significant challenge in classification tasks, where traditional approaches often lead to biased models and unreliable predictions. Undersampling and oversampling techniques have been commonly employed to address this issue, yet they suffer from inherent limitations stemming from their simplistic approach such as loss of information and additional biases respectively. In this paper, we propose a novel framework that leverages learning theory and concentration inequalities to overcome the shortcomings of traditional solutions. We focus on understanding the uncertainty in a class-dependent manner, as captured by confidence bounds that we directly embed into the learning process. By incorporating class-dependent estimates, our method can effectively adapt to the varying degrees of imbalance across different classes, resulting in more robust and reliable classification outcomes. We empirically show how our framework provides a promising direction for handling imbalanced data in classification tasks, offering practitioners a valuable tool for building more accurate and trustworthy models.
Paper Structure (27 sections, 2 theorems, 14 equations, 6 figures, 4 tables, 1 algorithm)

This paper contains 27 sections, 2 theorems, 14 equations, 6 figures, 4 tables, 1 algorithm.

Key Result

Theorem 1

Let $X_1,...,X_k$ be i.i.d. from a continuous distribution. Then $P(X_{a_1}<...<X_{a_k})= 1/k!$ for any permutation $a_1,...,a_k$ of $1,...,k$.

Figures (6)

  • Figure 1: Training data of two normally distributed classes with class means $\{{-\boldsymbol{\mu}}, {\boldsymbol{\mu}\}}$ and where ${\boldsymbol{\mu}} = \left[{55}\right]$ and unit covariance and classifier $f(x) = sgn(\langle \phi(x), w\rangle + b)$, refer to Section \ref{['sec:background']} for classifier definitions. The decision boundary of a classifier (solid line) should be shifted towards the dashed line by adjusting the original bias term $b$ to $b'$ since there is a high level of uncertainty that the minority class is well defined in the training data.
  • Figure 2: Left: 2D class imbalanced training dataset with samples $S_1$ and $S_2$ of each class. Right: $S_1$ and $S_2$ in the projected space from $\langle \phi(x), w\rangle$. Empirical means of each sample are shown with a cross. Distances $\hat{\bar{R}}_i$ and $\hat{D}$ are shown with solid bars. $\hat{R}_i$ are shown with dashed bars from the upper bound of Eq. \ref{['R_hat_upper_bound']}, where both $\delta_i = 0.\dot{9}$.
  • Figure 3: Logistic Regression trained on synthetic data described in Section \ref{['sec:synthetic']}. Training data points are blue and red circles for the majority and minority classes respectively. Decision boundary is shown with a white line.
  • Figure 4: Projected space from Logistic Regression of the training data. Training data points are blue and red circles for the majority and minority classes respectively. $\hat{\bar{R}}_1$ and $\hat{\bar{R}}_2$ are overlapping since the classifier has not been able to linearly separate the training data in the projected space.
  • Figure 5: Projected space from Logistic Regression of the training data. Training data points where $\xi_n=0$ after optimising $\delta_i$ are shown with blue and red circles for the majority and minority classes respectively. Where $\hat{\bar{R}}_1$ and $\hat{\bar{R}}_2$ meeting from the optimised $\delta_i$ is given as the new decision boundary for classification.
  • ...and 1 more figures

Theorems & Definitions (2)

  • Theorem 1
  • Theorem 2: Shawe-Taylor & Cristianini, 2004