Table of Contents
Fetching ...

Robust performance metrics for imbalanced classification problems

Hajo Holzmann, Bernhard Klar

TL;DR

The paper investigates how traditional binary classification metrics such as F-score, Jaccard, and MCC fail to be robust under extreme class imbalance, causing Bayes-optimal classifiers to neglect the minority class as $\pi \to 0$. It formalizes Bayes decision rules as density-ratio thresholding and shows that the optimal threshold $\delta^*$ depends on the metric, leading to non-robust recall in imbalanced settings. To address this, robust variants of the F-score and MCC are proposed, incorporating tunable parameters to bound $\delta^*$ and preserve meaningful true-positive rates when the minority class is scarce. The work links these metrics to ROC and precision-recall analyses, supports findings with numerical simulations and a credit-default dataset, and provides practical guidance for integrating robust metrics with ROC/precision-recall visualizations for classifier selection in imbalanced problems.

Abstract

We show that established performance metrics in binary classification, such as the F-score, the Jaccard similarity coefficient or Matthews' correlation coefficient (MCC), are not robust to class imbalance in the sense that if the proportion of the minority class tends to $0$, the true positive rate (TPR) of the Bayes classifier under these metrics tends to $0$ as well. Thus, in imbalanced classification problems, these metrics favour classifiers which ignore the minority class. To alleviate this issue we introduce robust modifications of the F-score and the MCC for which, even in strongly imbalanced settings, the TPR is bounded away from $0$. We numerically illustrate the behaviour of the various performance metrics in simulations as well as on a credit default data set. We also discuss connections to the ROC and precision-recall curves and give recommendations on how to combine their usage with performance metrics.

Robust performance metrics for imbalanced classification problems

TL;DR

The paper investigates how traditional binary classification metrics such as F-score, Jaccard, and MCC fail to be robust under extreme class imbalance, causing Bayes-optimal classifiers to neglect the minority class as . It formalizes Bayes decision rules as density-ratio thresholding and shows that the optimal threshold depends on the metric, leading to non-robust recall in imbalanced settings. To address this, robust variants of the F-score and MCC are proposed, incorporating tunable parameters to bound and preserve meaningful true-positive rates when the minority class is scarce. The work links these metrics to ROC and precision-recall analyses, supports findings with numerical simulations and a credit-default dataset, and provides practical guidance for integrating robust metrics with ROC/precision-recall visualizations for classifier selection in imbalanced problems.

Abstract

We show that established performance metrics in binary classification, such as the F-score, the Jaccard similarity coefficient or Matthews' correlation coefficient (MCC), are not robust to class imbalance in the sense that if the proportion of the minority class tends to , the true positive rate (TPR) of the Bayes classifier under these metrics tends to as well. Thus, in imbalanced classification problems, these metrics favour classifiers which ignore the minority class. To alleviate this issue we introduce robust modifications of the F-score and the MCC for which, even in strongly imbalanced settings, the TPR is bounded away from . We numerically illustrate the behaviour of the various performance metrics in simulations as well as on a credit default data set. We also discuss connections to the ROC and precision-recall curves and give recommendations on how to combine their usage with performance metrics.
Paper Structure (18 sections, 1 theorem, 51 equations, 9 figures, 21 tables)

This paper contains 18 sections, 1 theorem, 51 equations, 9 figures, 21 tables.

Key Result

Theorem 3

Under Assumption ass:boundedderv, any Bayes classifier is of the form where $\delta^*$ is determined by the fixed-point equation

Figures (9)

  • Figure 1: Plots of the optimal threshold $\delta^\ast$ as a function of $\pi$ for the different performance metrics used in Example \ref{['ex:lda']}
  • Figure 2: Plots of the optimal threshold $\delta^\ast$ as a function of $\pi$ for the different performance metrics used in Example \ref{['ex:qda']}.
  • Figure 3: Plots of the optimal threshold $\delta^\ast$ as a function of $\pi$ for $\text{F}_{\text{rb}}$-score proposed in subsection \ref{['ex:fgen']} with different choices of the parameters.
  • Figure 4: Plots of the optimal threshold $\delta^\ast$ as a function of $\pi$ for $\text{MCC}_{\text{rb}}$ proposed in subsection \ref{['ex:mccgen']}, using different choices of the parameter $d$.
  • Figure 5: Population ROC curve for Example \ref{['ex:lda']} in black, solid line. Plot of recall against 1-precision in color with different line styles for varying $\pi$. Circles show the corresponding MCC optimal points; triangles show the points optimal with respect to the robust MCC with $d=0.15$.
  • ...and 4 more figures

Theorems & Definitions (8)

  • Example 1: Approximately balanced case
  • Example 2: Imbalanced case
  • Theorem 3
  • Example 4: LDA
  • Example 5: QDA
  • proof : Proof of Theorem \ref{['th:decisiobnounddens']}
  • Example 6
  • Example 7: Sensitivity and specificity for LDA and QDA