Interpretable Failure Detection with Human-Level Concepts
Kien X. Nguyen, Tang Li, Xi Peng
TL;DR
This work tackles the problem of overconfident misclassifications in safety-critical settings by replacing category-level confidence signals with human-level concepts, enabling both reliable failure detection and interpretable explanations. The proposed ORCA framework, with two variants ORCA-B and ORCA-R, leverages a concept collection per category and ordinal ranking of concept activations to produce more faithful confidence estimates than Maximum Softmax Prediction. Empirical results across natural and satellite image benchmarks show substantial reductions in false positives (FPR@95TPR) and competitive AUROC and ACC, with ORCA-R often delivering the strongest failure-detection performance, especially in remote sensing. The approach additionally provides interpretable failure reasons via concept-level signals, facilitating debugging and improving safety in real-world deployments.
Abstract
Reliable failure detection holds paramount importance in safety-critical applications. Yet, neural networks are known to produce overconfident predictions for misclassified samples. As a result, it remains a problematic matter as existing confidence score functions rely on category-level signals, the logits, to detect failures. This research introduces an innovative strategy, leveraging human-level concepts for a dual purpose: to reliably detect when a model fails and to transparently interpret why. By integrating a nuanced array of signals for each category, our method enables a finer-grained assessment of the model's confidence. We present a simple yet highly effective approach based on the ordinal ranking of concept activation to the input image. Without bells and whistles, our method significantly reduce the false positive rate across diverse real-world image classification benchmarks, specifically by 3.7% on ImageNet and 9% on EuroSAT.
