Hierarchical Selective Classification

Shani Goren; Ido Galil; Ran El-Yaniv

Hierarchical Selective Classification

Shani Goren, Ido Galil, Ran El-Yaniv

TL;DR

Hierarchical selective classification (HSC) extends selective classification to hierarchies, enabling predictions to be made at varying levels of specificity based on uncertainty. The approach formalizes hierarchical risk and coverage, introduces hierarchical risk-coverage curves, and develops hierarchical inference rules (notably Climbing) paired with an optimal-threshold algorithm that guarantees a user-specified accuracy with high probability using a calibration set. Empirical results on over 1,100 ImageNet models and iNat21 models show substantial improvements in hAURC when leveraging hierarchy-aware predictions, with CLIP-based regimes and large-scale pretraining delivering the largest gains; hierarchical calibration also improves. The work situates HSC as a practical, post-hoc method that improves risk control and interpretability in hierarchical classification tasks, with future directions exploring alternative confidence scores and selective hierarchical training.

Abstract

Deploying deep neural networks for risk-sensitive tasks necessitates an uncertainty estimation mechanism. This paper introduces hierarchical selective classification, extending selective classification to a hierarchical setting. Our approach leverages the inherent structure of class relationships, enabling models to reduce the specificity of their predictions when faced with uncertainty. In this paper, we first formalize hierarchical risk and coverage, and introduce hierarchical risk-coverage curves. Next, we develop algorithms for hierarchical selective classification (which we refer to as "inference rules"), and propose an efficient algorithm that guarantees a target accuracy constraint with high probability. Lastly, we conduct extensive empirical studies on over a thousand ImageNet classifiers, revealing that training regimes such as CLIP, pretraining on ImageNet21k and knowledge distillation boost hierarchical selective performance.

Hierarchical Selective Classification

TL;DR

Abstract

Paper Structure (20 sections, 4 equations, 7 figures, 6 tables, 5 algorithms)

This paper contains 20 sections, 4 equations, 7 figures, 6 tables, 5 algorithms.

Introduction
Problem Setup
Hierarchical Selective Inference Rules
Optimal Selective Threshold Algorithm
Experiments
Related Work
Concluding Remarks
Acknowledgments
Hierarchical Severity Risk
Max-Coverage Inference Rule Algorithm
Jumping Inference Rule
Hierarchy Traversal for Threshold Finding
Proof of Theorem 1
Comparison of Inference Rules Without Temperature Scaling
Threshold Algorithm Results on iNat21
...and 5 more sections

Figures (7)

Figure 1: A detailed example of HSC for the output of ViT-L/16-384 on a specific sample. The base classifier outputs leaf softmax scores, with internal node scores being the sum of their descendant leaves' scores, displayed in parentheses next to each node. The base classifier incorrectly classifies the image as a 'Golden Retriever' with low confidence. A selective classifier can either make the same incorrect leaf prediction if the confidence threshold is below 0.29, or reject the sample. A hierarchical selective classifier with the Climbing inference rule (see Section \ref{['sec:inference_rules']}) climbs the path from the predicted leaf to the root until the confidence threshold $\theta$ is met. Setting $\theta$ above 0.29 yields a hierarchically correct prediction, with smaller $\theta$ values increasing the coverage. An Algorithm for determining the optimal threshold is introduced in Section \ref{['sec:threshold_alg']}.
Figure 2: (\ref{['fig:rc_curve']}) hierarchical RC curve of a ViT-L/16-384 model trained on ImageNet1k, evaluated with the 0/1 loss as the risk and softmax response as its confidence function $\kappa$. The purple shaded area represents the area under the RC curve (hAURC). Full coverage occurs when the model accepts all leaf predictions, for which the risk is 0.13. Increasing the confidence threshold leads to the rejection of more samples. For example, when the threshold is 0.77 the the risk is 0.04, with coverage 0.8. (\ref{['fig:inf_rules_curves']}) hierarchical RC curves of different inference rules with EVA-L/14-196 eva as the base classifier. When the coverage is 1.0, all inference rules predict leaves. Each inference rule achieves a different trade-off, resulting in distinct curves. This example represents the prevalent case, where the "hierarchically-ignorant" selective inference rule performs the worst and Climbing outperforms MC.
Figure 3: Individual model examples comparing the hierarchical selective threshold algorithm against DARTS, with each algorithm repeated 1000 times. The mean and median results are shown in dark green. The light green area shows the $\epsilon$ interval around the target accuracy, and the remaining area is marked in red (i.e., each repetition has a $1-\delta$ probability of being in the green area and a $\delta$ probability of being in the red area). The target accuracy is 95% and $1-\delta=0.9$. In both examples, the target accuracy error of DARTS is high, and the entirety of its accuracy distribution lies outside of the confidence interval. Left: EVA-Giant/14 eva. DARTS fails to meet the constraint, whereas our algorithm's mean accuracy is very close to the target. Right: ResNet-152 resnetsb. While our algorithm has a near-perfect mean accuracy, DARTS rejects all samples, resulting in zero coverage.
Figure 4: Comparison of different methods by their improvement in hAURC, relative to the same model’s performance without the method. The number of models evaluated for each method: knowledge distillation: 42, pretraining: 61, CLIP: 16, semi-supervised learning: 11, adversarial training: 8.
Figure 5: Aggregated (mean and SEM) CC curves of 1,115 ImageNet models.
...and 2 more figures

Hierarchical Selective Classification

TL;DR

Abstract

Hierarchical Selective Classification

Authors

TL;DR

Abstract

Table of Contents

Figures (7)