Learning Confidence Bounds for Classification with Imbalanced Data
Matt Clifford, Jonathan Erskine, Alexander Hepburn, Raúl Santos-Rodríguez, Dario Garcia-Garcia
TL;DR
This work tackles class imbalance by embedding class-specific confidence bounds into the bias term of a pre-trained binary classifier, grounded in concentration-inequality theory. It derives principled bounds and optimizes class-wise confidence levels to ensure non-overlapping class supports in a projected feature space, with a bias term adjusted at their intersection; slack variables are introduced to handle non-separable data. Empirical results on synthetic and real imbalanced datasets show the method often yields superior minority-class performance (G-Mean and F1) while maintaining broad applicability across classifiers, compared with SMOTE, thresholding, Bayes risk, and weighted approaches. The approach offers a theoretically principled, post-training adjustment that can be implemented with existing models and open-source tooling, with noted limitations when the learned representation poorly separates classes.
Abstract
Class imbalance poses a significant challenge in classification tasks, where traditional approaches often lead to biased models and unreliable predictions. Undersampling and oversampling techniques have been commonly employed to address this issue, yet they suffer from inherent limitations stemming from their simplistic approach such as loss of information and additional biases respectively. In this paper, we propose a novel framework that leverages learning theory and concentration inequalities to overcome the shortcomings of traditional solutions. We focus on understanding the uncertainty in a class-dependent manner, as captured by confidence bounds that we directly embed into the learning process. By incorporating class-dependent estimates, our method can effectively adapt to the varying degrees of imbalance across different classes, resulting in more robust and reliable classification outcomes. We empirically show how our framework provides a promising direction for handling imbalanced data in classification tasks, offering practitioners a valuable tool for building more accurate and trustworthy models.
