Evaluating Deep Neural Networks in Deployment (A Comparative and Replicability Study)
Eduard Pinconschi, Divya Gopinath, Rui Abreu, Corina S. Pasareanu
TL;DR
This work addresses the challenge of evaluating DNN reliability in deployment, particularly in safety-critical contexts where ground truth is unavailable at inference. It analyzes white-box post-hoc methods (SelfChecker, DeepInfer) and extends exploration to Prophecy, introducing TrustBench and TrustDNN to standardize benchmarks, data preparation, models, and metrics across domains. The replication study finds substantial difficulties reproducing results across original artifacts and ambiguity in evaluation metrics, but demonstrates the feasibility of cross-domain comparisons when using a unified framework and common metrics, notably MCC. Overall, DeepInfer and SelfChecker show complementary strengths while Prophecy offers data-type agnosticism with room for stabilization; the TrustBench/TrustDNN framework provides practical, open-source tooling to advance reproducible, comparable reliability research for safety-critical AI systems.
Abstract
As deep neural networks (DNNs) are increasingly used in safety-critical applications, there is a growing concern for their reliability. Even highly trained, high-performant networks are not 100% accurate. However, it is very difficult to predict their behavior during deployment without ground truth. In this paper, we provide a comparative and replicability study on recent approaches that have been proposed to evaluate the reliability of DNNs in deployment. We find that it is hard to run and reproduce the results for these approaches on their replication packages and even more difficult to run them on artifacts other than their own. Further, it is difficult to compare the effectiveness of the approaches, due to the lack of clearly defined evaluation metrics. Our results indicate that more effort is needed in our research community to obtain sound techniques for evaluating the reliability of neural networks in safety-critical domains. To this end, we contribute an evaluation framework that incorporates the considered approaches and enables evaluation on common benchmarks, using common metrics.
