Table of Contents
Fetching ...

Precision and Recall Reject Curves for Classification

Lydia Fischer, Patricia Wollstadt

TL;DR

The paper addresses the challenge of evaluating classifiers with reject options when precision or recall is the preferred performance metric, especially for imbalanced data. It introduces precision-reject curves (PRC) and recall-reject curves (RRC) and validates them using prototype-based classifiers (LVQ variants) with certainty measures (Conf, RelSim) and a Bayes baseline. Across artificial data, benchmark datasets, and medical data, PRC and RRC provide more meaningful, reliable insights than accuracy-based ARCs, particularly at higher acceptance rates, where ARCs can mislead. The work offers practical tools for deploying reliable, high-certainty predictions in safety-critical and imbalanced-domain applications, with future work targeting multi-class extensions and additional evaluation metrics.

Abstract

For some classification scenarios, it is desirable to use only those classification instances that a trained model associates with a high certainty. To obtain such high-certainty instances, previous work has proposed accuracy-reject curves. Reject curves allow to evaluate and compare the performance of different certainty measures over a range of thresholds for accepting or rejecting classifications. However, the accuracy may not be the most suited evaluation metric for all applications, and instead precision or recall may be preferable. This is the case, for example, for data with imbalanced class distributions. We therefore propose reject curves that evaluate precision and recall, the recall-reject curve and the precision-reject curve. Using prototype-based classifiers from learning vector quantization, we first validate the proposed curves on artificial benchmark data against the accuracy reject curve as a baseline. We then show on imbalanced benchmarks and medical, real-world data that for these scenarios, the proposed precision- and recall-curves yield more accurate insights into classifier performance than accuracy reject curves.

Precision and Recall Reject Curves for Classification

TL;DR

The paper addresses the challenge of evaluating classifiers with reject options when precision or recall is the preferred performance metric, especially for imbalanced data. It introduces precision-reject curves (PRC) and recall-reject curves (RRC) and validates them using prototype-based classifiers (LVQ variants) with certainty measures (Conf, RelSim) and a Bayes baseline. Across artificial data, benchmark datasets, and medical data, PRC and RRC provide more meaningful, reliable insights than accuracy-based ARCs, particularly at higher acceptance rates, where ARCs can mislead. The work offers practical tools for deploying reliable, high-certainty predictions in safety-critical and imbalanced-domain applications, with future work targeting multi-class extensions and additional evaluation metrics.

Abstract

For some classification scenarios, it is desirable to use only those classification instances that a trained model associates with a high certainty. To obtain such high-certainty instances, previous work has proposed accuracy-reject curves. Reject curves allow to evaluate and compare the performance of different certainty measures over a range of thresholds for accepting or rejecting classifications. However, the accuracy may not be the most suited evaluation metric for all applications, and instead precision or recall may be preferable. This is the case, for example, for data with imbalanced class distributions. We therefore propose reject curves that evaluate precision and recall, the recall-reject curve and the precision-reject curve. Using prototype-based classifiers from learning vector quantization, we first validate the proposed curves on artificial benchmark data against the accuracy reject curve as a baseline. We then show on imbalanced benchmarks and medical, real-world data that for these scenarios, the proposed precision- and recall-curves yield more accurate insights into classifier performance than accuracy reject curves.
Paper Structure (7 sections, 11 equations, 3 figures)

This paper contains 7 sections, 11 equations, 3 figures.

Figures (3)

  • Figure 1: The averaged reject curves for the different models of the artificial Gaussian data are shown (mean over models in different runs). The solid lines represent the optimal classification performance of a Bayesian classifier. The PRCs and RRCs based on RelSim or Conf perform similar to the optimal ARCs FischerHW14fischer2016optimal for the important regime of at least $80\,\%$ accepted data points.
  • Figure 2: The image shows results for different LVQ models on benchmark data. The ARCs FischerHW14fischer2016optimal serve as comparison. The PRCs and the RRCs based on RelSim or Conf perform differently for the given set-ups. This reveals interesting insights for the user in order to chose a suited reject threshold for the application scenario at hand.
  • Figure 3: The averaged curves of the ARC fischer2016optimal and PRC perform similar in the important regime of at least $80\,\%$ accepted data points for the Adrenal data while the RRC has a different shape. The RelSim is used as measure.