When Does Confidence-Based Cascade Deferral Suffice?

Wittawat Jitkrittum; Neha Gupta; Aditya Krishna Menon; Harikrishna Narasimhan; Ankit Singh Rawat; Sanjiv Kumar

When Does Confidence-Based Cascade Deferral Suffice?

Wittawat Jitkrittum, Neha Gupta, Aditya Krishna Menon, Harikrishna Narasimhan, Ankit Singh Rawat, Sanjiv Kumar

TL;DR

This work analyzes when confidence-based cascade deferral is sufficient and when it is not. It derives the Bayes-optimal deferral rule for a two-model cascade and shows that confidence-based deferral can be suboptimal when the downstream model is a specialist, under label noise, or under distribution shift. To address these limitations, the authors introduce post-hoc deferral rules that approximate the optimal deferral using only the first model’s outputs, and they provide a finite-sample excess-risk analysis. Through experiments on ImageNet and CIFAR with various perturbations, they demonstrate that post-hoc deferral can outperform confidence-based approaches, especially in low-deferral regimes and under the described failure modes. The findings offer practical guidance on when to deploy confidence-based cascades and how to design data-driven post-hoc deferral strategies to tighten accuracy-cost trade-offs.

Abstract

Cascades are a classical strategy to enable inference cost to vary adaptively across samples, wherein a sequence of classifiers are invoked in turn. A deferral rule determines whether to invoke the next classifier in the sequence, or to terminate prediction. One simple deferral rule employs the confidence of the current classifier, e.g., based on the maximum predicted softmax probability. Despite being oblivious to the structure of the cascade -- e.g., not modelling the errors of downstream models -- such confidence-based deferral often works remarkably well in practice. In this paper, we seek to better understand the conditions under which confidence-based deferral may fail, and when alternate deferral strategies can perform better. We first present a theoretical characterisation of the optimal deferral rule, which precisely characterises settings under which confidence-based deferral may suffer. We then study post-hoc deferral mechanisms, and demonstrate they can significantly improve upon confidence-based deferral in settings where (i) downstream models are specialists that only work well on a subset of inputs, (ii) samples are subject to label noise, and (iii) there is distribution shift between the train and test set.

When Does Confidence-Based Cascade Deferral Suffice?

TL;DR

Abstract

Paper Structure (37 sections, 8 theorems, 32 equations, 11 figures, 1 table, 1 algorithm)

This paper contains 37 sections, 8 theorems, 32 equations, 11 figures, 1 table, 1 algorithm.

Introduction
Background and Related Work
Cascades and Deferral Rules
Confidence-based Cascades
Optimal Deferral Rules for Cascades
Optimisation Objective
The Bayes-Optimal Deferral Rule
Plug-in Estimators of the Bayes-Optimal Deferral Rule
Relation to Existing Work
From Confidence-Based to Post-Hoc Deferral
When Does Confidence-Based Deferral Suffice?
Post-Hoc Estimates of the Deferral Rule
Finite-Sample Analysis for Post-Hoc Deferral Rules
Relation to Existing Work
Experimental Illustration
...and 22 more sections

Key Result

Proposition 3.1

Let $\eta_{y'}(x)\stackrel{.}{=}\mathbb{P}(y'|x)$. Then, the Bayes-optimal deferral rule for the risk in (eq:oracle_risk_k2) is:

Figures (11)

Figure 1: Test accuracy vs deferral rate of plug-in estimates (\ref{['sec:plug_in_rule']}) for the oracle rule. Here, $h^{(1)}$ is a MobileNet V2 trained on all ImageNet classes, and $h^{(2)}$ is a dog specialist trained on all images in the dog synset plus a fraction of non-dog training examples, which we vary. As the fraction decreases, $h^{(2)}$ specialises in classifying different types of dogs. By considering the confidence of $h^{(2)}$ (Relative Confidence), one gains accuracy by selectively deferring only dog images.
Figure 2: Test accuracy vs deferral rate of the post-hoc approaches in \ref{['tbl:post_hoc_estimates']} under the three settings described in \ref{['sec:chow-versus-oracle']}: 1) specialist (row 1), 2) label noise (row 2), and 3) distribution shift. Row 1: As the fraction of non-dog training images decreases, model 2 becomes a dog specialist model. Increase in the non-uniformity in its error probabilities allows post-hoc approaches to learn to only defer dog images. Row 2: As label noise increases, the difference in the probability of correct prediction under each model becomes zero (i.e., probability tends to chance level). Thus, it is sub-optimal to defer affected inputs since model 2's correctness is also at chance level. Being oblivious to model 2, confidence-based deferral underperforms. For full details, see \ref{['sec:chow_suffices']}. Row 3: As the skewness of the label distribution increases, so does the difference in the probability of correct prediction under each model (recall the optimal rule in \ref{['prop:oracle_k2']}), and it becomes necessary to account for model 2's probability of correct prediction when deferring. Hence, confidence-based deferral underperforms.
Figure 3: Training and test accuracy of post-hoc approaches in the ImageNet-Dog specialist setting. Model 2 (EfficientNet B0) is trained with all dog images and 8% of non-dog images. Observe that a post-hoc model (i.e., Diff-01) can severely overfit to the training set and fail to generalise.
Figure 4: Calibration plots visualising the empirical probability of the event $h_1(x) \neq y \land h_2(x) = y$, i.e., the first model predicts incorrectly and the second model predicts correctly. In settings where confidence-based deferral performs poorly, the first model's confidence is a poor predictor of whether the second model makes a correct prediction: the likelihood of this event tends to be systematically over-estimated.
Figure 5: Test accuracy vs deferral rate of the post-hoc approaches (Diff-01, Diff-Prob), confidence thresholding (Confidence), and entropy thresholding (Entropy). We considered the Mini-ImageNet dataset (with noise rate 60%) from jiang2020. The two base models are ResNet 10 models with widths 16 (small model) and 64 (large model). Consistent with our analysis, confidence-based deferral underperforms in the presence of label noise.
...and 6 more figures

Theorems & Definitions (15)

Proposition 3.1
Corollary 3.2
Lemma 4.1
Lemma 4.2
proof : Proof of Proposition \ref{['prop:oracle_k2']}
proof : Proof of \ref{['corr:excess-risk']}
Lemma A.1
proof
proof : Proof of \ref{['lemm:when-confidence-suffice']}
Lemma A.2
...and 5 more

When Does Confidence-Based Cascade Deferral Suffice?

TL;DR

Abstract

When Does Confidence-Based Cascade Deferral Suffice?

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (15)