Table of Contents
Fetching ...

From Misclassifications to Outliers: Joint Reliability Assessment in Classification

Yang Li, Youyang Sha, Yinzhi Wang, Timothy Hospedales, Xi Shen, Shell Xu Hu, Xuanlong Yu

TL;DR

A unified evaluation framework that integrates OOD detection and failure prediction is proposed, quantified by new metrics DS-F1 and DS-AURC, where DS denotes double scoring functions, and a new benchmark for trustworthy classification is established.

Abstract

Building reliable classifiers is a fundamental challenge for deploying machine learning in real-world applications. A reliable system should not only detect out-of-distribution (OOD) inputs but also anticipate in-distribution (ID) errors by assigning low confidence to potentially misclassified samples. Yet, most prior work treats OOD detection and failure prediction as separated problems, overlooking their closed connection. We argue that reliability requires evaluating them jointly. To this end, we propose a unified evaluation framework that integrates OOD detection and failure prediction, quantified by our new metrics DS-F1 and DS-AURC, where DS denotes double scoring functions. Experiments on the OpenOOD benchmark show that double scoring functions yield classifiers that are substantially more reliable than traditional single scoring approaches. Our analysis further reveals that OOD-based approaches provide notable gains under simple or far-OOD shifts, but only marginal benefits under more challenging near-OOD conditions. Beyond evaluation, we extend the reliable classifier SURE and introduce SURE+, a new approach that significantly improves reliability across diverse scenarios. Together, our framework, metrics, and method establish a new benchmark for trustworthy classification and offer practical guidance for deploying robust models in real-world settings. The source code is publicly available at https://github.com/Intellindust-AI-Lab/SUREPlus.

From Misclassifications to Outliers: Joint Reliability Assessment in Classification

TL;DR

A unified evaluation framework that integrates OOD detection and failure prediction is proposed, quantified by new metrics DS-F1 and DS-AURC, where DS denotes double scoring functions, and a new benchmark for trustworthy classification is established.

Abstract

Building reliable classifiers is a fundamental challenge for deploying machine learning in real-world applications. A reliable system should not only detect out-of-distribution (OOD) inputs but also anticipate in-distribution (ID) errors by assigning low confidence to potentially misclassified samples. Yet, most prior work treats OOD detection and failure prediction as separated problems, overlooking their closed connection. We argue that reliability requires evaluating them jointly. To this end, we propose a unified evaluation framework that integrates OOD detection and failure prediction, quantified by our new metrics DS-F1 and DS-AURC, where DS denotes double scoring functions. Experiments on the OpenOOD benchmark show that double scoring functions yield classifiers that are substantially more reliable than traditional single scoring approaches. Our analysis further reveals that OOD-based approaches provide notable gains under simple or far-OOD shifts, but only marginal benefits under more challenging near-OOD conditions. Beyond evaluation, we extend the reliable classifier SURE and introduce SURE+, a new approach that significantly improves reliability across diverse scenarios. Together, our framework, metrics, and method establish a new benchmark for trustworthy classification and offer practical guidance for deploying robust models in real-world settings. The source code is publicly available at https://github.com/Intellindust-AI-Lab/SUREPlus.
Paper Structure (24 sections, 2 theorems, 32 equations, 2 figures, 7 tables, 2 algorithms)

This paper contains 24 sections, 2 theorems, 32 equations, 2 figures, 7 tables, 2 algorithms.

Key Result

Lemma 1

For any function $g: \mathcal{X} \to \mathbb{R}$ and any $x_0 \in \mathcal{X}$, it holds that

Figures (2)

  • Figure 1: Joint evaluation of ID and OOD. a) Two ResNet-18 models were trained with cross-entropy (CE) and CutMix separately on CIFAR-100. For ID metrics (the first two columns, on CIFAR-100), CE outperforms CutMix; while for OOD detection (the third column, on MNIST), CutMix is better. Given this dilemma, the previous evaluation systems would struggle to give a sensible ranking between these two methods. Our new metrics (DS-F1 and DS-AURC, the last two columns) confirm that CE works better than CutMix in a joint ID and OOD evaluation, which has an intuitive explaination in b) that the F1 score surface of CE is above the surface of CutMix for most $(\tau_{\rm ID}, \tau_{\rm OOD})$ pairs, including their best F1 scores.
  • Figure 2: Example of DS-AURC. Selective risk versus coverage for double scoring is shown in green. DS-AURC uses the best risk in each coverage bin (orange, in Eq. \ref{['eq:dual-aurc-prob']}), resulting in a lower (better) AURC. Results with ResNet-18 on CIFAR-100 (ID) and CIFAR-10 (OOD), using VIM wang2022vim for the OOD score and MSP hendrycks2016baseline for the ID score.

Theorems & Definitions (4)

  • Lemma 1: Property of minima
  • Lemma 2: Property of maxima under set inclusion
  • Proof A.1
  • Proof A.2