Table of Contents
Fetching ...

Stacked Confusion Reject Plots (SCORE)

Stephan Hasler, Lydia Fischer

TL;DR

The paper addresses interpretability of rejection options in classification by introducing Stacked Confusion Reject Plots (SCORE). SCORE replaces global metrics with a per-class confusion stack that aggregates $confusion_{c_t,c_p}$ across varying acceptance rates, making class-imbalanced effects and misclassification risks visible. The method supports variants in ordering, alignment, and normalization and generalizes to $C>2$ by condensing per-true-class errors, offering a flexible, interpretable diagnostic tool. Demonstrations on artificial Gaussian data and a public Python package indicate practical relevance for high-stakes domains where trustworthy rejection decisions are critical.

Abstract

Machine learning is more and more applied in critical application areas like health and driver assistance. To minimize the risk of wrong decisions, in such applications it is necessary to consider the certainty of a classification to reject uncertain samples. An established tool for this are reject curves that visualize the trade-off between the number of rejected samples and classification performance metrics. We argue that common reject curves are too abstract and hard to interpret by non-experts. We propose Stacked Confusion Reject Plots (SCORE) that offer a more intuitive understanding of the used data and the classifier's behavior. We present example plots on artificial Gaussian data to document the different options of SCORE and provide the code as a Python package.

Stacked Confusion Reject Plots (SCORE)

TL;DR

The paper addresses interpretability of rejection options in classification by introducing Stacked Confusion Reject Plots (SCORE). SCORE replaces global metrics with a per-class confusion stack that aggregates across varying acceptance rates, making class-imbalanced effects and misclassification risks visible. The method supports variants in ordering, alignment, and normalization and generalizes to by condensing per-true-class errors, offering a flexible, interpretable diagnostic tool. Demonstrations on artificial Gaussian data and a public Python package indicate practical relevance for high-stakes domains where trustworthy rejection decisions are critical.

Abstract

Machine learning is more and more applied in critical application areas like health and driver assistance. To minimize the risk of wrong decisions, in such applications it is necessary to consider the certainty of a classification to reject uncertain samples. An established tool for this are reject curves that visualize the trade-off between the number of rejected samples and classification performance metrics. We argue that common reject curves are too abstract and hard to interpret by non-experts. We propose Stacked Confusion Reject Plots (SCORE) that offer a more intuitive understanding of the used data and the classifier's behavior. We present example plots on artificial Gaussian data to document the different options of SCORE and provide the code as a Python package.
Paper Structure (5 sections, 2 equations, 2 figures, 2 tables)

This paper contains 5 sections, 2 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Standard reject plots vs. a variant of SCORE for the 2-class setting. a) ACR nadeem2009accuracy, PRC & RRC Fischer_wsom24 might be hard to interpret. b) Stacked confusions give more insight about the class distribution and the behavior of the classifier.
  • Figure 2: Further variants of SCORE for the 2-class setting. Ordering, alignment, and normalization are used to highlight different aspects of the confusion stack.