Selective Classification Under Distribution Shifts

Hengyue Liang; Le Peng; Ju Sun

Selective Classification Under Distribution Shifts

Hengyue Liang, Le Peng, Ju Sun

TL;DR

An SC framework that takes into account distribution shifts is proposed, termed generalized selective classification, that covers label-shifted and covariate-shifted samples, in addition to typical in-distribution samples, the first of its kind in the SC literature.

Abstract

In selective classification (SC), a classifier abstains from making predictions that are likely to be wrong to avoid excessive errors. To deploy imperfect classifiers -- either due to intrinsic statistical noise of data or for robustness issue of the classifier or beyond -- in high-stakes scenarios, SC appears to be an attractive and necessary path to follow. Despite decades of research in SC, most previous SC methods still focus on the ideal statistical setting only, i.e., the data distribution at deployment is the same as that of training, although practical data can come from the wild. To bridge this gap, in this paper, we propose an SC framework that takes into account distribution shifts, termed generalized selective classification, that covers label-shifted (or out-of-distribution) and covariate-shifted samples, in addition to typical in-distribution samples, the first of its kind in the SC literature. We focus on non-training-based confidence-score functions for generalized SC on deep learning (DL) classifiers, and propose two novel margin-based score functions. Through extensive analysis and experiments, we show that our proposed score functions are more effective and reliable than the existing ones for generalized SC on a variety of classification tasks and DL classifiers. Code is available at https://github.com/sun-umn/sc_with_distshift.

Selective Classification Under Distribution Shifts

TL;DR

Abstract

Paper Structure (41 sections, 2 theorems, 35 equations, 10 figures, 8 tables, 2 algorithms)

This paper contains 41 sections, 2 theorems, 35 equations, 10 figures, 8 tables, 2 algorithms.

Introduction
Technical background and related work
Selective classification (SC)
Prior work in SC
Training-based scores
Manually designed (non-training-based) scores
SC under distribution shifts: generalized SC
Out-of-distribution (OOD) detection as a weak form of generalized SC
Other related concepts
Prior work on SC with distribution shifts
Evaluation of generalized SC
Few words on implementing \ref{['alg: confidence-based sc']} in practice
Our method---margins as confidence scores for generalized SC
Scale sensitivity of SR-based scores
A quick numerical experiment
...and 26 more sections

Key Result

Lemma 3.1

Consider the raw logits $\boldsymbol z$, and without loss of generality assume that they are ordered in descending order without any ties, i.e., $z^{(1)} > z^{(2)} > \cdots$. We have that as $\lambda \to \infty$, where $\sim$ means asymptotic equivalence. In particular, all the asymptotic functions increase monotonically with respect to $z^{(1)} - z^{(2)}$.

Figures (10)

Figure 1: Visualization of the normalized AURC-$\alpha$---the area in blue divided by the coverage value $\alpha$.
Figure 2: RC curves for (b)$SR_{\text{max}}$, (c)$SR_{\text{doctor}}$, and (d)$SR_{\text{ent}}$, calculated based on scaled (by factor $0.1$, $1.0$, $2.0$, and $4.0$, respectively) raw logits from the optimal $4$-class linear classifier using data shown in (a). The RC curves for $RL_{\text{conf-M}}$ and $s_{\text{post}}$ are also plotted for reference, where $RL_{\text{conf-M}}$ is one of our proposed confidence-score functions.
Figure 3: Further analysis of the numerical example in \ref{['subsec: SR-based score is problematic']}. Case 1, Case 2, and Case 3 correspond to the original dataset in \ref{['subsec: SR-based score is problematic']}, the dataset after small perturbations, and the dataset after substantial perturbations, respectively. Here, (a-)'s are the RC curves achieved by different selection scores; (b-)'s are visualizations of the samples (one color per class), decision boundaries (dashed blue line) and the rejected samples (black crosses) at coverage $0.8$ by $RL_{\text{geo-M}}$; (c-)'s visualize the rejected samples (black crosses) at coverage $0.8$ by $SR_{\text{max}}$; and (d-)'s present the histogram of the robustness radius of the selected samples in by all score functions.
Figure 4: RC curves of different confidence-score functions on the model EVA for ImageNet. (a)-(d) are RC curves evaluated using samples from (a) In-D samples only, (b) In-D and covariate-shifted samples only, (c) In-D and label-shifted samples only, and (d) all samples, respectively. We group the curves by whether they are originally proposed for SC setups (solid lines) or for OOD detection (dashed lines).
Figure 5: RC curves of different confidence-score functions on the model FLYP for iWildCam and the model LISA for Amazon. (a)&(c) are RC curves evaluated using In-D samples only and (b)&(d) are RC curves evaluated using both In-D and covariate-shifted samples.
...and 5 more figures

Theorems & Definitions (3)

Lemma 3.1
Lemma B.1
proof

Selective Classification Under Distribution Shifts

TL;DR

Abstract

Selective Classification Under Distribution Shifts

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (3)