Rethinking Diversity in Deep Neural Network Testing
Zi Wang, Jihye Choi, Ke Wang, Somesh Jha
TL;DR
This work reframes deep neural network testing as a directed task focused on uncovering inputs that elicit misclassifications, arguing that diversity-based metrics often fail to expose buggy behaviors under both small distortions and natural input transformations. It introduces six directed metrics—forward margin score, forward loss score, backward margin score, backward loss score, mixed margin score, and mixed loss score—paired with gradient-based surrogates and an analytical linear-approximation framework to efficiently identify high-risk inputs. By distinguishing natural transformations from small distortions and applying a scale-aware mix of forward and backward reasoning, the approach achieves superior bug-exposure performance across four datasets and multiple architectures, outperforming state-of-the-art diversity baselines. The results underscore the importance of aligning testing objectives with the underlying transformation type and provide a practical framework for robust DNN testing with measurable scope limitations and efficiency advantages. The work has implications for both improving DNN reliability and informing metamorphic testing strategies in real-world settings where perturbations occur frequently.
Abstract
Motivated by the success of traditional software testing, numerous diversity measures have been proposed for testing deep neural networks (DNNs). In this study, we propose a shift in perspective, advocating for the consideration of DNN testing as directed testing problems rather than diversity-based testing tasks. We note that the objective of testing DNNs is specific and well-defined: identifying inputs that lead to misclassifications. Consequently, a more precise testing approach is to prioritize inputs with a higher potential to induce misclassifications, as opposed to emphasizing inputs that enhance "diversity." We derive six directed metrics for DNN testing. Furthermore, we conduct a careful analysis of the appropriate scope for each metric, as applying metrics beyond their intended scope could significantly diminish their effectiveness. Our evaluation demonstrates that (1) diversity metrics are particularly weak indicators for identifying buggy inputs resulting from small input perturbations, and (2) our directed metrics consistently outperform diversity metrics in revealing erroneous behaviors of DNNs across all scenarios.
