Unveiling AI's Blind Spots: An Oracle for In-Domain, Out-of-Domain, and Adversarial Errors

Shuangpeng Han; Mengmi Zhang

Unveiling AI's Blind Spots: An Oracle for In-Domain, Out-of-Domain, and Adversarial Errors

Shuangpeng Han, Mengmi Zhang

TL;DR

This work tackles the reliability challenge of AI image classification by introducing a dedicated mentor model that predicts whether a mentee will err on a given image across in-domain, out-of-domain, and adversarial data. The mentor uses a two-stream architecture with logit distillation and a binary error predictor, optimized by $L = abla$ and guided by an evolving weight parameter $ abla$. Key findings show that adversarial-error training yields the strongest predictive signal, transformer-based mentors generalize well across mentees, and the proposed SuperMentor further improves error prediction across diverse error types and architectures, including real-world medical imaging tasks. The framework promises to enhance trust and safety in AI systems by enabling proactive correction and monitoring of model behavior in high-stakes applications.

Abstract

AI models make mistakes when recognizing images-whether in-domain, out-of-domain, or adversarial. Predicting these errors is critical for improving system reliability, reducing costly mistakes, and enabling proactive corrections in real-world applications such as healthcare, finance, and autonomous systems. However, understanding what mistakes AI models make, why they occur, and how to predict them remains an open challenge. Here, we conduct comprehensive empirical evaluations using a "mentor" model-a deep neural network designed to predict another "mentee" model's errors. Our findings show that the mentor excels at learning from a mentee's mistakes on adversarial images with small perturbations and generalizes effectively to predict in-domain and out-of-domain errors of the mentee. Additionally, transformer-based mentor models excel at predicting errors across various mentee architectures. Subsequently, we draw insights from these observations and develop an "oracle" mentor model, dubbed SuperMentor, that can outperform baseline mentors in predicting errors across different error types from the ImageNet-1K dataset. Our framework paves the way for future research on anticipating and correcting AI model behaviors, ultimately increasing trust in AI systems.

Unveiling AI's Blind Spots: An Oracle for In-Domain, Out-of-Domain, and Adversarial Errors

TL;DR

and guided by an evolving weight parameter

. Key findings show that adversarial-error training yields the strongest predictive signal, transformer-based mentors generalize well across mentees, and the proposed SuperMentor further improves error prediction across diverse error types and architectures, including real-world medical imaging tasks. The framework promises to enhance trust and safety in AI systems by enabling proactive correction and monitoring of model behavior in high-stakes applications.

Abstract

Paper Structure (27 sections, 1 equation, 12 figures, 4 tables)

This paper contains 27 sections, 1 equation, 12 figures, 4 tables.

Introduction
Related work
Error monitoring systems for AI models.
Out-of-domain detection.
Adversarial attack and defense.
Experimental setups
Mentors
Mentees and their datasets
Datasets for training and testing mentors
Baselines and evaluation metric
Results
Training on specific errors of mentees impacts the performance of mentors
Mentor architectures matter in error predictions
Training on images with smaller perturbations helps error predictions
Mentors generalize across mentees
...and 12 more sections

Figures (12)

Figure 1: AI models make mistakes and an "oracle" mentor model predicts when they will happen. A "mentee" neural network (black) was trained for multi-class image recognition, but it can still misclassify in-domain, out-of-domain, and adversarial images. For instance, it might mislabel an in-domain dog image as a cat. The mentor model (blue), inputting the same images as the mentee, predicts whether the mentee will make a mistake. For example, if the mentee incorrectly labels an adversarial dog image, the mentor’s ground truth label is "wrong"; conversely, if the mentee correctly labels an out-of-domain dog image, the mentor's label is "correct". The mentee's parameters are frozen (snowflake), while the mentor's are trainable (fire). During inference (orange), the mentor predicts whether the mentee will make an error on test images that have never been seen by both the mentee and the mentor.
Figure 2: Overview of a mentor model. Given a fixed mentee model (snowflake), the mentor model takes an input image and uses a pre-trained backbone on ImageNet-1K deng2009imagenet to extract features. The feature maps are then processed in two streams via multi-layer perceptrons (MLP)s. The output logits $z_R$ from one stream are compared with the mentee’s output logits $z_E$ using a distillation loss $L_{d}$. The other stream performs a binary prediction of whether the mentee makes a mistake or not. The prediction is supervised by a logistic regression loss $L_{r}$. The parameters of MLPs in the two streams are not shared.
Figure 3: Mentors trained on adversarial images of a mentee outperform mentors trained on OOD and ID images of the same mentee. Average accuracy of a mentor trained on one type of error of a mentee for (a) C10, (b) C100 and (c) IN datasets is presented. Three types of errors made by a mentee are categorized based on in-domain (ID, blue), out-of-domain (OOD, orange), and images generated by adversarial attacks (AA, green). In each subplot, the labels on the x-axis are interpreted as [mentee]-[mentor], where "V" and "R" represent ViT and ResNet50 architectures for a mentee or a mentor respectively. Error bars indicate the standard deviation. The dotted black line indicates the chance level. See Sec. \ref{['sec:datasets']} and Sec. \ref{['sec:baseline_intro']} for error types and the evaluation metric. The four sets of bars in each subfigure correspond to the confusion matrices shown in subfigures (a), (b), (c), and (d) of Appendix, Fig. \ref{['fig:heatmap_C10']}- \ref{['fig:heatmap_IN']}.
Figure 4: A mentor's accuracy is heavily influenced by the levels of image distortions introduced by out-of-domain perturbations and adversarial attacks. ViT mentor's accuracy is a function of varying image distortion levels from PIFGSM gao2020patch and Speckle Noise (SpN) hendrycks2019robustness to the C10, C100 and IN images of a ResNet50-based mentee. The black dashed line indicates the chance level.
Figure 5: Mentors can generalize their error predictions across different mentee architectures. Mentors trained on mentee A's predictions (x-axis) are evaluated against the predictions from mentee B (y-axis). Each marker is a generalization experiment of a mentor trained on different error types (marker shapes) in different image datasets (colours) of a mentee. The black dash line indicates the diagonal.
...and 7 more figures

Unveiling AI's Blind Spots: An Oracle for In-Domain, Out-of-Domain, and Adversarial Errors

TL;DR

Abstract

Unveiling AI's Blind Spots: An Oracle for In-Domain, Out-of-Domain, and Adversarial Errors

Authors

TL;DR

Abstract

Table of Contents

Figures (12)