Unveiling AI's Blind Spots: An Oracle for In-Domain, Out-of-Domain, and Adversarial Errors
Shuangpeng Han, Mengmi Zhang
TL;DR
This work tackles the reliability challenge of AI image classification by introducing a dedicated mentor model that predicts whether a mentee will err on a given image across in-domain, out-of-domain, and adversarial data. The mentor uses a two-stream architecture with logit distillation and a binary error predictor, optimized by $L = abla$ and guided by an evolving weight parameter $ abla$. Key findings show that adversarial-error training yields the strongest predictive signal, transformer-based mentors generalize well across mentees, and the proposed SuperMentor further improves error prediction across diverse error types and architectures, including real-world medical imaging tasks. The framework promises to enhance trust and safety in AI systems by enabling proactive correction and monitoring of model behavior in high-stakes applications.
Abstract
AI models make mistakes when recognizing images-whether in-domain, out-of-domain, or adversarial. Predicting these errors is critical for improving system reliability, reducing costly mistakes, and enabling proactive corrections in real-world applications such as healthcare, finance, and autonomous systems. However, understanding what mistakes AI models make, why they occur, and how to predict them remains an open challenge. Here, we conduct comprehensive empirical evaluations using a "mentor" model-a deep neural network designed to predict another "mentee" model's errors. Our findings show that the mentor excels at learning from a mentee's mistakes on adversarial images with small perturbations and generalizes effectively to predict in-domain and out-of-domain errors of the mentee. Additionally, transformer-based mentor models excel at predicting errors across various mentee architectures. Subsequently, we draw insights from these observations and develop an "oracle" mentor model, dubbed SuperMentor, that can outperform baseline mentors in predicting errors across different error types from the ImageNet-1K dataset. Our framework paves the way for future research on anticipating and correcting AI model behaviors, ultimately increasing trust in AI systems.
