Robust performance metrics for imbalanced classification problems
Hajo Holzmann, Bernhard Klar
TL;DR
The paper investigates how traditional binary classification metrics such as F-score, Jaccard, and MCC fail to be robust under extreme class imbalance, causing Bayes-optimal classifiers to neglect the minority class as $\pi \to 0$. It formalizes Bayes decision rules as density-ratio thresholding and shows that the optimal threshold $\delta^*$ depends on the metric, leading to non-robust recall in imbalanced settings. To address this, robust variants of the F-score and MCC are proposed, incorporating tunable parameters to bound $\delta^*$ and preserve meaningful true-positive rates when the minority class is scarce. The work links these metrics to ROC and precision-recall analyses, supports findings with numerical simulations and a credit-default dataset, and provides practical guidance for integrating robust metrics with ROC/precision-recall visualizations for classifier selection in imbalanced problems.
Abstract
We show that established performance metrics in binary classification, such as the F-score, the Jaccard similarity coefficient or Matthews' correlation coefficient (MCC), are not robust to class imbalance in the sense that if the proportion of the minority class tends to $0$, the true positive rate (TPR) of the Bayes classifier under these metrics tends to $0$ as well. Thus, in imbalanced classification problems, these metrics favour classifiers which ignore the minority class. To alleviate this issue we introduce robust modifications of the F-score and the MCC for which, even in strongly imbalanced settings, the TPR is bounded away from $0$. We numerically illustrate the behaviour of the various performance metrics in simulations as well as on a credit default data set. We also discuss connections to the ROC and precision-recall curves and give recommendations on how to combine their usage with performance metrics.
