Predictor-Rejector Multi-Class Abstention: Theoretical Analysis and Algorithms

Anqi Mao; Mehryar Mohri; Yutao Zhong

Predictor-Rejector Multi-Class Abstention: Theoretical Analysis and Algorithms

Anqi Mao, Mehryar Mohri, Yutao Zhong

TL;DR

The paper addresses multi-class abstention by developing a predictor-rejector framework with novel surrogate losses that come with strong non-asymptotic and realizable consistency guarantees. It establishes both single-stage and two-stage approaches, proving H-consistency bounds for three multiclass surrogates (mean absolute error, $ ho$-hinge, and $ ho$-margin) and showing realizable consistency under scaling-closed hypothesis sets. A central result is that score-based abstention can fail to recover Bayes optimal decisions in some settings, whereas the predictor-rejector approach yields Bayes-optimal solutions with tractable surrogate losses. Empirically, two-stage predictor-rejector methods outperform state-of-the-art score-based baselines on SVHN, CIFAR-10, and CIFAR-100, illustrating practical gains in abstention-aware classification. Overall, the work provides both theoretical guarantees and practical algorithms for robust multi-class abstention, addressing open questions in the literature and offering guidance for deploying abstention-aware models in real-world systems.

Abstract

We study the key framework of learning with abstention in the multi-class classification setting. In this setting, the learner can choose to abstain from making a prediction with some pre-defined cost. We present a series of new theoretical and algorithmic results for this learning problem in the predictor-rejector framework. We introduce several new families of surrogate losses for which we prove strong non-asymptotic and hypothesis set-specific consistency guarantees, thereby resolving positively two existing open questions. These guarantees provide upper bounds on the estimation error of the abstention loss function in terms of that of the surrogate loss. We analyze both a single-stage setting where the predictor and rejector are learned simultaneously and a two-stage setting crucial in applications, where the predictor is learned in a first stage using a standard surrogate loss such as cross-entropy. These guarantees suggest new multi-class abstention algorithms based on minimizing these surrogate losses. We also report the results of extensive experiments comparing these algorithms to the current state-of-the-art algorithms on CIFAR-10, CIFAR-100 and SVHN datasets. Our results demonstrate empirically the benefit of our new surrogate losses and show the remarkable performance of our broadly applicable two-stage abstention algorithm.

Predictor-Rejector Multi-Class Abstention: Theoretical Analysis and Algorithms

TL;DR

-hinge, and

-margin) and showing realizable consistency under scaling-closed hypothesis sets. A central result is that score-based abstention can fail to recover Bayes optimal decisions in some settings, whereas the predictor-rejector approach yields Bayes-optimal solutions with tractable surrogate losses. Empirically, two-stage predictor-rejector methods outperform state-of-the-art score-based baselines on SVHN, CIFAR-10, and CIFAR-100, illustrating practical gains in abstention-aware classification. Overall, the work provides both theoretical guarantees and practical algorithms for robust multi-class abstention, addressing open questions in the literature and offering guidance for deploying abstention-aware models in real-world systems.

Abstract

Paper Structure (38 sections, 19 theorems, 96 equations, 1 figure, 2 tables)

This paper contains 38 sections, 19 theorems, 96 equations, 1 figure, 2 tables.

Introduction
Preliminaries
Counterexample for score-based abstention losses
Predictor-rejector surrogate losses
Single-stage predictor-rejector surrogate losses
Two-stage predictor-rejector surrogate losses
Other advantages of the predictor-rejector formulation
Experiments
Conclusion
Related work
Remarks on some key results
Significance of two-stage formulation compared with single-stage losses
Difference between predictor-rejector and score-based formulations
Experimental details
Setup.
...and 23 more sections

Key Result

Theorem 1

Assume that ${\mathscr H}$ is symmetric and complete, and that ${\mathscr R}$ is complete. If there exists $x \in {\mathscr X}$ such that $\inf_{h \in {\mathscr H}} \mathop{\mathrm{\mathbb{E}}}\limits_y[*]{\ell(h,X, y) \mid X = x}\neq \frac{\beta \Psi (*){1 - \max_{y\in {\mathscr Y}}p(x, y)}}{\alpha

Figures (1)

Figure 1: Counterexample for score-based abstention losses.

Theorems & Definitions (27)

Theorem 1: Negative result for single-stage surrogates
Theorem 2: $(\sH, \sR)$-consistency bounds for single-stage surrogates
Corollary 2: Excess error bounds for single-stage surrogates
Theorem 3: $\sR$-consistency bounds for second-stage surrogates
Corollary 4
Theorem 5: $(\sH, \sR)$-consistency bounds for two-stage approach
Corollary 6
Definition 7
Theorem 8
Corollary 9
...and 17 more

Predictor-Rejector Multi-Class Abstention: Theoretical Analysis and Algorithms

TL;DR

Abstract

Predictor-Rejector Multi-Class Abstention: Theoretical Analysis and Algorithms

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (27)