Overcoming Common Flaws in the Evaluation of Selective Classification Systems

Jeremias Traub; Till J. Bungert; Carsten T. Lüth; Michael Baumgartner; Klaus H. Maier-Hein; Lena Maier-Hein; Paul F Jaeger

Overcoming Common Flaws in the Evaluation of Selective Classification Systems

Jeremias Traub, Till J. Bungert, Carsten T. Lüth, Michael Baumgartner, Klaus H. Maier-Hein, Lena Maier-Hein, Paul F Jaeger

TL;DR

This work defines 5 requirements for multi-threshold metrics in selective classification regarding task alignment, interpretability, and flexibility, and shows how current approaches fail to meet them and proposes the Area under the Generalized Risk Coverage curve, which meets all requirements and can be directly interpreted as the average risk of undetected failures.

Abstract

Selective Classification, wherein models can reject low-confidence predictions, promises reliable translation of machine-learning based classification systems to real-world scenarios such as clinical diagnostics. While current evaluation of these systems typically assumes fixed working points based on pre-defined rejection thresholds, methodological progress requires benchmarking the general performance of systems akin to the $\mathrm{AUROC}$ in standard classification. In this work, we define 5 requirements for multi-threshold metrics in selective classification regarding task alignment, interpretability, and flexibility, and show how current approaches fail to meet them. We propose the Area under the Generalized Risk Coverage curve ($\mathrm{AUGRC}$), which meets all requirements and can be directly interpreted as the average risk of undetected failures. We empirically demonstrate the relevance of $\mathrm{AUGRC}$ on a comprehensive benchmark spanning 6 data sets and 13 confidence scoring functions. We find that the proposed metric substantially changes metric rankings on 5 out of the 6 data sets.

Overcoming Common Flaws in the Evaluation of Selective Classification Systems

TL;DR

Abstract

in standard classification. In this work, we define 5 requirements for multi-threshold metrics in selective classification regarding task alignment, interpretability, and flexibility, and show how current approaches fail to meet them. We propose the Area under the Generalized Risk Coverage curve (

), which meets all requirements and can be directly interpreted as the average risk of undetected failures. We empirically demonstrate the relevance of

on a comprehensive benchmark spanning 6 data sets and 13 confidence scoring functions. We find that the proposed metric substantially changes metric rankings on 5 out of the 6 data sets.

Paper Structure (21 sections, 11 equations, 7 figures, 4 tables)

This paper contains 21 sections, 11 equations, 7 figures, 4 tables.

Introduction
Refined Task Formulation
Evaluating SC systems in applied settings
Evaluating SC systems for method development and benchmarking
Requirements for Selective Classification multi-threshold metrics
Current multi-threshold metrics in SC do not fulfill requirements R1-R5
Area under the Generalized Risk Coverage Curve
Empirical Study
Comparing Method Rankings of AUGRC and AURC
Showcasing Implications of AURC Shortcomings on Real-world Data
Conclusion
Appendix
Technical Details
AURC Failure Contribution
Relationship between AUGRC and failure AUROC
...and 6 more sections

Figures (7)

Figure 1: The AUGRC metric based on Generalized Risk overcomes common flaws in current evaluation of Selective classification (SC).a) Refined task definition for SC. Analogously to standard classification, we distinguish between holistic evaluation for method development and benchmarking using multi-threshold metrics versus evaluation of specific application scenarios at pre-determined working points. The current most prevalent multi-threshold metric in SC, $\mathrm{AURC}$, is based on Selective Risk, a concept for working point evaluation that is not suitable for aggregation over rejection thresholds (red arrow). To fill this gap, we formulate the new concept of Generalized Risk and a corresponding metric, $\mathrm{AUGRC}$ (green arrow). b) We formalize our perspective on SC evaluation by identifying five key requirements for multi-threshold metrics and analyze how previous metrics fail to fulfill them. Abbreviations, CSF: Confidence Scoring Function, AU(G)RC: Area Under the (Generalized) Risk Coverage curve.
Figure 2: The proposed AUGRC metric resolves shortcomings of AURC. All figures are based on rankings of predictions according to descending associated confidence scores induced by a CSF. All $\mathrm{AURC}$, $\mathrm{e\text{-}AURC}$, and $\mathrm{AUGRC}$ values are scaled by $\times1000$. a) shows the contribution of an individual failure case on the $\mathrm{AURC}$ and $\mathrm{AUGRC}$ metrics depending on its ranking position (for technical details, see Section \ref{['sec:failurecontribution']}). While $\mathrm{AUGRC}$ reflects the intuitive behavior of weighing the failure cases proportional to their ranking position, the $\mathrm{AURC}$ puts excessive weight on high-confidence failures. b-d) Toy example of three CSFs ranking 20 predictions to show how $\mathrm{AUGRC}$ resolves the broken monotonicity requirement (R2) of $\mathrm{AURC}$. Despite equal $\mathrm{AUROC_f}$ and equal $\mathrm{acc}$ in CSF-1 and CSF-2, the $\mathrm{AURC}$ improves. And $\mathrm{AURC}$ even improves in CSF-3, which features lower $\mathrm{AUROC_f}$ and lower $\mathrm{acc}$ compared to CSF-1. e-f) The corresponding risk-coverage curves reveal that the non-intuitive behavior of $\mathrm{AURC}$ is due to the excessive effect of the high-confidence failure of CSF-1 on the Selective Risk, which is resolved in the Generalized Risk.
Figure 3: Substantial differences in method rankings for AUGRC and AURC. On 5 out of 6 datasets, the top-3 CSFs (out of 13 compared methods) change when employing the proposed $\mathrm{AUGRC}$ instead of $\mathrm{AURC}$. This demonstrates the practical relevance of the $\mathrm{AUGRC}$ metric for Selective Classification evaluation. CSFs are color-coded and sorted from top (best) to bottom (worst) by average rank based on $500$ bootstrap samples from the test dataset to ensure ranking stability. Ranking changes are reflected in changes in the color sequence and highlighted by red arrows. We assess the stability of the method rankings for each metric individually using one-sided Wilcoxon signed-rank tests based on the bootstrap samples at $5\%$ significance level with adjustment for multiple testing according to Holm. Adjacent to each ranking, we present the resulting significance maps for the pairwise CSF comparisons. These maps can be interpreted as follows: At each grid position $(x,y)$, filled entries indicate that metric values of CSF $y$ are ranked significantly better than those from CSF $x$ (across bootstrap samples), cross-marks indicate no significant superiority. An ideal ranking exhibits only filled entries above the diagonal.
Figure 4: The conceptual shortcomings of AURC affects method assessment in practice. We illustrate the practical effects of excessive weight high-confidence failures in $\mathrm{AURC}$ by comparing the performance of two CSFs, DG-Res and MCD-PE, on the CIFAR-10 test dataset. (a) shows the coverage curves based on Selective Risk and Generalized Risk for both CSFs. The $\mathrm{AURC}$ violates the monotonicity requirement (R2) in practice, favoring MCD-PE despite a lower classification performance and ranking quality compared to DG-Res. (b) displays the images associated with the top-$k$ high-confidence failures. For DG-Res, the four failures correspond to the first four peaks in the Selective Risk curve, up to $\text{coverage}\approx0.27$ (the total number of failures is 446). Only a few high-confidence failures significantly increase the $\mathrm{AURC}$. For both CSFs, the images associated with high-confidence failures exhibit high label ambiguity or are incorrectly labeled, indicating that the $\mathrm{AURC}$ may amplify the influence of label noise in practice. AURC and AUGRC values are scaled by $\times 1000$.
Figure 5: Visualization of the relationship between AUGRC and AUROC$_\text{f}$. (a) The Selective Risk curve can be transformed into the Generalized Risk curve via multiplication by the respective coverages. The resulting curve is monotonically increasing and bounded by the diagonal; decreasing Selective Risk corresponds to a plateau in Generalized Risk. The $\mathrm{AUGRC}$ corresponds to the $\mathrm{AUGRC}$ of an optimal CSF (shaded red) plus the re-scaled $\mathrm{AUROC_f}$ (shaded in green). The $\mathrm{AUROC_f}$ corresponds to the fraction of the area (parallelogram) enclosed by the green dashed line that lies above the Generalized Risk curve. (b) $\mathrm{AUGRC}$ (color-coded) and its negative gradients (arrows) in the Accuracy-$\mathrm{AUROC_f}$ space.
...and 2 more figures

Overcoming Common Flaws in the Evaluation of Selective Classification Systems

TL;DR

Abstract

Overcoming Common Flaws in the Evaluation of Selective Classification Systems

Authors

TL;DR

Abstract

Table of Contents

Figures (7)