Table of Contents
Fetching ...

AnyLoss: Transforming Classification Metrics into Loss Functions

Doheon Han, Nuno Moniz, Nitesh V Chawla

TL;DR

This work tackles the challenge of making confusion-matrix-based evaluation metrics differentiable to enable direct optimization during training. It introduces AnyLoss, a general framework that uses a probability-amplification function $A$ defined by $A(p) = \frac{1}{1+e^{-L(p-0.5)}}$ to build a differentiable confusion matrix and derives gradient-based losses for metrics such as accuracy, $F_\beta$, geometric mean, and balanced accuracy. Extensive experiments across 102 diverse datasets and four imbalanced datasets demonstrate the broad applicability, competitive learning speed, and improved handling of imbalance compared to baselines and a state-of-the-art surrogate. The approach offers a practical path to directly optimize task-specific metrics, potentially reducing hyperparameter tuning and improving real-world performance, with future work on multi-class extensions and automatic $L$ selection.

Abstract

Many evaluation metrics can be used to assess the performance of models in binary classification tasks. However, most of them are derived from a confusion matrix in a non-differentiable form, making it very difficult to generate a differentiable loss function that could directly optimize them. The lack of solutions to bridge this challenge not only hinders our ability to solve difficult tasks, such as imbalanced learning, but also requires the deployment of computationally expensive hyperparameter search processes in model selection. In this paper, we propose a general-purpose approach that transforms any confusion matrix-based metric into a loss function, \textit{AnyLoss}, that is available in optimization processes. To this end, we use an approximation function to make a confusion matrix represented in a differentiable form, and this approach enables any confusion matrix-based metric to be directly used as a loss function. The mechanism of the approximation function is provided to ensure its operability and the differentiability of our loss functions is proved by suggesting their derivatives. We conduct extensive experiments under diverse neural networks with many datasets, and we demonstrate their general availability to target any confusion matrix-based metrics. Our method, especially, shows outstanding achievements in dealing with imbalanced datasets, and its competitive learning speed, compared to multiple baseline models, underscores its efficiency.

AnyLoss: Transforming Classification Metrics into Loss Functions

TL;DR

This work tackles the challenge of making confusion-matrix-based evaluation metrics differentiable to enable direct optimization during training. It introduces AnyLoss, a general framework that uses a probability-amplification function defined by to build a differentiable confusion matrix and derives gradient-based losses for metrics such as accuracy, , geometric mean, and balanced accuracy. Extensive experiments across 102 diverse datasets and four imbalanced datasets demonstrate the broad applicability, competitive learning speed, and improved handling of imbalance compared to baselines and a state-of-the-art surrogate. The approach offers a practical path to directly optimize task-specific metrics, potentially reducing hyperparameter tuning and improving real-world performance, with future work on multi-class extensions and automatic selection.

Abstract

Many evaluation metrics can be used to assess the performance of models in binary classification tasks. However, most of them are derived from a confusion matrix in a non-differentiable form, making it very difficult to generate a differentiable loss function that could directly optimize them. The lack of solutions to bridge this challenge not only hinders our ability to solve difficult tasks, such as imbalanced learning, but also requires the deployment of computationally expensive hyperparameter search processes in model selection. In this paper, we propose a general-purpose approach that transforms any confusion matrix-based metric into a loss function, \textit{AnyLoss}, that is available in optimization processes. To this end, we use an approximation function to make a confusion matrix represented in a differentiable form, and this approach enables any confusion matrix-based metric to be directly used as a loss function. The mechanism of the approximation function is provided to ensure its operability and the differentiability of our loss functions is proved by suggesting their derivatives. We conduct extensive experiments under diverse neural networks with many datasets, and we demonstrate their general availability to target any confusion matrix-based metrics. Our method, especially, shows outstanding achievements in dealing with imbalanced datasets, and its competitive learning speed, compared to multiple baseline models, underscores its efficiency.
Paper Structure (21 sections, 23 equations, 5 figures, 18 tables)

This paper contains 21 sections, 23 equations, 5 figures, 18 tables.

Figures (5)

  • Figure 1: Our method in the multi-layer perceptron (MLP) structure. Input X and weights W generate the net value Z, and the sigmoid function $\sigma$ transform the net value into the class probability P. The approximation function $A$ generates YH by amplifying the probability P. The confusion matrix is constructed in a differentiable form by the ground truth Y and the YH. Consequently, our loss function, AnyLoss, aimed at any confusion matrix-based metric, is available in a differentiable form.
  • Figure 2: The approximation function A($p_{i}$) with 2 different L values. On the left, a smaller L, the A($p_{i}$=0.1) should be smaller than 0.1, but it generates 0.12. On the right, a larger L, the A($p_{i}$=0.9) converges to 1.0.
  • Figure 3: Learning curves with 4 different L values. A model is not learning with a small L but is better with a larger L. However, a model stops learning if a L is too large.
  • Figure 4: The winning probability of AnyLoss against baseline models in stacked bar graphs, with the bottom for losing, the middle for drawing, and the top for winning. M and B represent, respectively, MSE and BCE. In the SLP, AnyLoss mostly has a larger winning probability against baseline models. In some cases, a larger drawing probability is observed. In the MLP, the results are similar to those in the SLP, but more cases with a larger drawing probability are observed. And there are no red colors, meaning no cases of AnyLoss losing.
  • Figure 5: The learning curves BCE vs. AnyLoss. They show similar slopes in dataset 1, and AnyLoss shows a steeper slope in other datasets, meaning it learns faster and needs fewer epochs to achieve its optimal point.