Table of Contents
Fetching ...

Typicalness-Aware Learning for Failure Detection

Yijun Liu, Jiequan Cui, Zhuotao Tian, Senqiao Yang, Qingdong He, Xiaoling Wang, Jingyong Su

TL;DR

A metric is devised that quantifies the typicalness of each sample, enabling the dynamic adjustment of the logit magnitude during the training process, so the problem of overconfidence can be mitigated and the problem of overconfidence can be mitigated.

Abstract

Deep neural networks (DNNs) often suffer from the overconfidence issue, where incorrect predictions are made with high confidence scores, hindering the applications in critical systems. In this paper, we propose a novel approach called Typicalness-Aware Learning (TAL) to address this issue and improve failure detection performance. We observe that, with the cross-entropy loss, model predictions are optimized to align with the corresponding labels via increasing logit magnitude or refining logit direction. However, regarding atypical samples, the image content and their labels may exhibit disparities. This discrepancy can lead to overfitting on atypical samples, ultimately resulting in the overconfidence issue that we aim to address. To tackle the problem, we have devised a metric that quantifies the typicalness of each sample, enabling the dynamic adjustment of the logit magnitude during the training process. By allowing atypical samples to be adequately fitted while preserving reliable logit direction, the problem of overconfidence can be mitigated. TAL has been extensively evaluated on benchmark datasets, and the results demonstrate its superiority over existing failure detection methods. Specifically, TAL achieves a more than 5% improvement on CIFAR100 in terms of the Area Under the Risk-Coverage Curve (AURC) compared to the state-of-the-art. Code is available at https://github.com/liuyijungoon/TAL.

Typicalness-Aware Learning for Failure Detection

TL;DR

A metric is devised that quantifies the typicalness of each sample, enabling the dynamic adjustment of the logit magnitude during the training process, so the problem of overconfidence can be mitigated and the problem of overconfidence can be mitigated.

Abstract

Deep neural networks (DNNs) often suffer from the overconfidence issue, where incorrect predictions are made with high confidence scores, hindering the applications in critical systems. In this paper, we propose a novel approach called Typicalness-Aware Learning (TAL) to address this issue and improve failure detection performance. We observe that, with the cross-entropy loss, model predictions are optimized to align with the corresponding labels via increasing logit magnitude or refining logit direction. However, regarding atypical samples, the image content and their labels may exhibit disparities. This discrepancy can lead to overfitting on atypical samples, ultimately resulting in the overconfidence issue that we aim to address. To tackle the problem, we have devised a metric that quantifies the typicalness of each sample, enabling the dynamic adjustment of the logit magnitude during the training process. By allowing atypical samples to be adequately fitted while preserving reliable logit direction, the problem of overconfidence can be mitigated. TAL has been extensively evaluated on benchmark datasets, and the results demonstrate its superiority over existing failure detection methods. Specifically, TAL achieves a more than 5% improvement on CIFAR100 in terms of the Area Under the Risk-Coverage Curve (AURC) compared to the state-of-the-art. Code is available at https://github.com/liuyijungoon/TAL.

Paper Structure

This paper contains 17 sections, 10 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Illustration of the motivation. We observe that directly aligning the predictions of atypical samples to the target label is not appropriate, causing overconfidence (horse with 95% confidence). Instead, the confidence should be aligned with the human perception. During training, the cross-entropy loss increases the magnitude $\|\boldsymbol{f}\|$ and adjusts their direction towards the target (represented by the angle $\alpha$). Consider this example where an image of a human body with a horse head is presented, the loss may optimize towards $\boldsymbol{f}_2$ in the blue box, which is not the ideal outcome direction. Instead, it would be better to optimize towards $\boldsymbol{f}_1$, rather than being biased towards either one, ensuring a more balanced and unbiased representation and allowing for a more accurate estimation of confidence.
  • Figure 2: The differences between closely related tasks. The blue curve represents the decision boundary, and the shaded area in the figure indicates incorrect predictions. (a) illustrates the objective of OoD-D tasks to reject predictions with semantic shifts and accept in-distribution predictions, without concern for predictions with covariate shifts. (b) shows the old setting of FD tasks, accepting correct in-distribution predictions and rejecting incorrect out-of-distribution predictions. (c) displays the new setting of FD tasks, accepting correct in-distribution predictions and correct predictions with covariate shifts, while rejecting incorrect in-distribution predictions, incorrect predictions with covariate shifts, and predictions with semantic shifts. (d) illustrates examples of OoD-D, Old FD, and New FD tasks. A classifier trained on CIFAR10 Krizhevsky09 is evaluated on 6 images under a whole range of relevant distribution shifts: For instance, the 3rd and the 4th images in grayscale depict an airplane and a horse which encounter covariate shifts from that in the original CIFAR10. The 5th and the 6th images depict samples belonging to unseen categories with semantic shifts.
  • Figure 3: The framework of TAL. During training, statistical information (mean $\mu_j$ and variance $\sigma_j$) of features from correct predictions updates the Historical Features Queue (HFQ) at time-step t. The typicalness measure $\tau$ is calculated by comparing these statistics between the current batch and the HFQ. This $\tau$ influences the overall loss calculation, guiding the model to differentiate between atypical and typical samples. In the inference phase, TAL operates similarly to a model trained with conventional cross-entropy. Confidence is derived from the cosine similarity of the predicted logit direction, emphasizing our approach of using direction as a more reliable confidence metric. The framework distinguishes between typical (high $\tau$) and atypical (low $\tau$) samples, influencing the optimization process accordingly.
  • Figure 4: (a) and (b) is the ablation study of $T_{\text{min}}$, $T_{\text{max}}$. And (c) is the ablation study on the length of the Historical Feature Queue.
  • Figure 5: (a) Comparison of the Mean of Features between ID and OOD; (b) Comparison of different methods for measuring typicality; (c) The Risk-Coverage curves on old and new setting FD tasks; (d) Examples of typical and atypical examples.
  • ...and 1 more figures