A Comprehensive Assessment Benchmark for Rigorously Evaluating Deep Learning Image Classifiers
Michael W. Spratling
TL;DR
The paper tackles the insufficiency of current evaluation protocols for image classifiers by proposing a Comprehensive Assessment Benchmark that tests robustness across five data types: clean, corrupt, adversarial, novel, and unrecognisable. It adopts a single metric, DAR, derived from an extended rejection-evaluation framework, and demonstrates that existing models—including state-of-the-art robust models—exhibit weaknesses when evaluated comprehensively across diverse data scenarios. By comparing MSP and other post-hoc rejection methods and applying the benchmark to multiple datasets, the work reveals systematic trade-offs between accuracy on known data and the ability to reject unknown or corrupted inputs, challenging prevailing assumptions about robustness. The proposed framework promotes reproducible, cross-type evaluation and suggests that robustness improvements may require ensemble or architectural strategies, with practical implications for deploying reliable image classifiers in real-world, diverse environments.
Abstract
Reliable and robust evaluation methods are a necessary first step towards developing machine learning models that are themselves robust and reliable. Unfortunately, current evaluation protocols typically used to assess classifiers fail to comprehensively evaluate performance as they tend to rely on limited types of test data, and ignore others. For example, using the standard test data fails to evaluate the predictions made by the classifier to samples from classes it was not trained on. On the other hand, testing with data containing samples from unknown classes fails to evaluate how well the classifier can predict the labels for known classes. This article advocates benchmarking performance using a wide range of different types of data and using a single metric that can be applied to all such data types to produce a consistent evaluation of performance. Using the proposed benchmark it is found that current deep neural networks, including those trained with methods that are believed to produce state-of-the-art robustness, are vulnerable to making mistakes on certain types of data. This means that such models will be unreliable in real-world scenarios where they may encounter data from many different domains, and that they are insecure as they can be easily fooled into making the wrong decisions. It is hoped that these results will motivate the wider adoption of more comprehensive testing methods that will, in turn, lead to the development of more robust machine learning methods in the future. Code is available at: https://codeberg.org/mwspratling/RobustnessEvaluation
