Table of Contents
Fetching ...

Interpretable Failure Detection with Human-Level Concepts

Kien X. Nguyen, Tang Li, Xi Peng

TL;DR

This work tackles the problem of overconfident misclassifications in safety-critical settings by replacing category-level confidence signals with human-level concepts, enabling both reliable failure detection and interpretable explanations. The proposed ORCA framework, with two variants ORCA-B and ORCA-R, leverages a concept collection per category and ordinal ranking of concept activations to produce more faithful confidence estimates than Maximum Softmax Prediction. Empirical results across natural and satellite image benchmarks show substantial reductions in false positives (FPR@95TPR) and competitive AUROC and ACC, with ORCA-R often delivering the strongest failure-detection performance, especially in remote sensing. The approach additionally provides interpretable failure reasons via concept-level signals, facilitating debugging and improving safety in real-world deployments.

Abstract

Reliable failure detection holds paramount importance in safety-critical applications. Yet, neural networks are known to produce overconfident predictions for misclassified samples. As a result, it remains a problematic matter as existing confidence score functions rely on category-level signals, the logits, to detect failures. This research introduces an innovative strategy, leveraging human-level concepts for a dual purpose: to reliably detect when a model fails and to transparently interpret why. By integrating a nuanced array of signals for each category, our method enables a finer-grained assessment of the model's confidence. We present a simple yet highly effective approach based on the ordinal ranking of concept activation to the input image. Without bells and whistles, our method significantly reduce the false positive rate across diverse real-world image classification benchmarks, specifically by 3.7% on ImageNet and 9% on EuroSAT.

Interpretable Failure Detection with Human-Level Concepts

TL;DR

This work tackles the problem of overconfident misclassifications in safety-critical settings by replacing category-level confidence signals with human-level concepts, enabling both reliable failure detection and interpretable explanations. The proposed ORCA framework, with two variants ORCA-B and ORCA-R, leverages a concept collection per category and ordinal ranking of concept activations to produce more faithful confidence estimates than Maximum Softmax Prediction. Empirical results across natural and satellite image benchmarks show substantial reductions in false positives (FPR@95TPR) and competitive AUROC and ACC, with ORCA-R often delivering the strongest failure-detection performance, especially in remote sensing. The approach additionally provides interpretable failure reasons via concept-level signals, facilitating debugging and improving safety in real-world deployments.

Abstract

Reliable failure detection holds paramount importance in safety-critical applications. Yet, neural networks are known to produce overconfident predictions for misclassified samples. As a result, it remains a problematic matter as existing confidence score functions rely on category-level signals, the logits, to detect failures. This research introduces an innovative strategy, leveraging human-level concepts for a dual purpose: to reliably detect when a model fails and to transparently interpret why. By integrating a nuanced array of signals for each category, our method enables a finer-grained assessment of the model's confidence. We present a simple yet highly effective approach based on the ordinal ranking of concept activation to the input image. Without bells and whistles, our method significantly reduce the false positive rate across diverse real-world image classification benchmarks, specifically by 3.7% on ImageNet and 9% on EuroSAT.

Paper Structure

This paper contains 12 sections, 9 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Comparison between standard (MSP) and our approaches. MSP relies solely on class logits to predict failures, which is problematic in detecting overconfident but incorrect predictions. To tackle this problem, we propose to deconstruct each category into its associated human-level concepts for a finer-grained estimate of confidence.
  • Figure 2: Overview of the ORCA framework. We first prompt GPT-3.5 to construct the concept collection $\mathcal{A}$. We then pass the image and all the concepts into CLIP to retrieve the concept similarity scores, represented by the number above each bar, and sort them in descending order. Based on the top-$K$ responses, we analyze the interaction among concept activations through ordinal ranking to predict the model's failures, and interpret why it fails. "Detect failures" is triggered when the confidence falls below a predefined threshold. Best viewed in color.
  • Figure 3: Failure detection accuracy (AUROC) and false positive rate (FPR@95TPR) across different numbers of concepts on CIFAR-100. Overall, we can an increase in the number of concepts boosts the performance in both metrics.
  • Figure 4: Failure detection capabilities of each weighting function on EuroSAT, where Logarithmic consistently outperforms others.
  • Figure 5: Failure interpretation with human-level concepts. We show the confidence scores of the top $3$ categories (left histograms) and similarity scores of the top $10$ concepts (right histograms) from CIFAR-10. Standard methods might output overconfident misclassifications due to: (a) spurious correlation and (b) cross-category resemblance. Concept-level signals not only achieves better failure detection capability in such scenarios but also enables further interpretation of why the model fails. "auto" is short for "automobile."