Table of Contents
Fetching ...

Rethinking Diversity in Deep Neural Network Testing

Zi Wang, Jihye Choi, Ke Wang, Somesh Jha

TL;DR

This work reframes deep neural network testing as a directed task focused on uncovering inputs that elicit misclassifications, arguing that diversity-based metrics often fail to expose buggy behaviors under both small distortions and natural input transformations. It introduces six directed metrics—forward margin score, forward loss score, backward margin score, backward loss score, mixed margin score, and mixed loss score—paired with gradient-based surrogates and an analytical linear-approximation framework to efficiently identify high-risk inputs. By distinguishing natural transformations from small distortions and applying a scale-aware mix of forward and backward reasoning, the approach achieves superior bug-exposure performance across four datasets and multiple architectures, outperforming state-of-the-art diversity baselines. The results underscore the importance of aligning testing objectives with the underlying transformation type and provide a practical framework for robust DNN testing with measurable scope limitations and efficiency advantages. The work has implications for both improving DNN reliability and informing metamorphic testing strategies in real-world settings where perturbations occur frequently.

Abstract

Motivated by the success of traditional software testing, numerous diversity measures have been proposed for testing deep neural networks (DNNs). In this study, we propose a shift in perspective, advocating for the consideration of DNN testing as directed testing problems rather than diversity-based testing tasks. We note that the objective of testing DNNs is specific and well-defined: identifying inputs that lead to misclassifications. Consequently, a more precise testing approach is to prioritize inputs with a higher potential to induce misclassifications, as opposed to emphasizing inputs that enhance "diversity." We derive six directed metrics for DNN testing. Furthermore, we conduct a careful analysis of the appropriate scope for each metric, as applying metrics beyond their intended scope could significantly diminish their effectiveness. Our evaluation demonstrates that (1) diversity metrics are particularly weak indicators for identifying buggy inputs resulting from small input perturbations, and (2) our directed metrics consistently outperform diversity metrics in revealing erroneous behaviors of DNNs across all scenarios.

Rethinking Diversity in Deep Neural Network Testing

TL;DR

This work reframes deep neural network testing as a directed task focused on uncovering inputs that elicit misclassifications, arguing that diversity-based metrics often fail to expose buggy behaviors under both small distortions and natural input transformations. It introduces six directed metrics—forward margin score, forward loss score, backward margin score, backward loss score, mixed margin score, and mixed loss score—paired with gradient-based surrogates and an analytical linear-approximation framework to efficiently identify high-risk inputs. By distinguishing natural transformations from small distortions and applying a scale-aware mix of forward and backward reasoning, the approach achieves superior bug-exposure performance across four datasets and multiple architectures, outperforming state-of-the-art diversity baselines. The results underscore the importance of aligning testing objectives with the underlying transformation type and provide a practical framework for robust DNN testing with measurable scope limitations and efficiency advantages. The work has implications for both improving DNN reliability and informing metamorphic testing strategies in real-world settings where perturbations occur frequently.

Abstract

Motivated by the success of traditional software testing, numerous diversity measures have been proposed for testing deep neural networks (DNNs). In this study, we propose a shift in perspective, advocating for the consideration of DNN testing as directed testing problems rather than diversity-based testing tasks. We note that the objective of testing DNNs is specific and well-defined: identifying inputs that lead to misclassifications. Consequently, a more precise testing approach is to prioritize inputs with a higher potential to induce misclassifications, as opposed to emphasizing inputs that enhance "diversity." We derive six directed metrics for DNN testing. Furthermore, we conduct a careful analysis of the appropriate scope for each metric, as applying metrics beyond their intended scope could significantly diminish their effectiveness. Our evaluation demonstrates that (1) diversity metrics are particularly weak indicators for identifying buggy inputs resulting from small input perturbations, and (2) our directed metrics consistently outperform diversity metrics in revealing erroneous behaviors of DNNs across all scenarios.
Paper Structure (61 sections, 1 theorem, 16 equations, 3 figures, 28 tables)

This paper contains 61 sections, 1 theorem, 16 equations, 3 figures, 28 tables.

Key Result

Proposition 3.1

eq:l2 is the optimal solution to the linear approximation in the $\ell_2$-ball; and eq:linf is the optimal solution to the linear approximation in the $\ell_\infty$-ball.

Figures (3)

  • Figure 1: Accuracy change of CIFAR100 VGG in adaptive standard testing using various metrics, with different input transformations: only natural transformations, only small distortions, and both natural transformations and small distortions.
  • Figure 2: Pseudo-accuracy change of MNIST LeNet-5 in adaptive metamorphic testing using various metrics, with different input transformations: only natural transformations, only small distortions, and natural transformations and small distortions.
  • Figure 3: An illustration of linear approximation of $f(x) = 0.5*x^2$ (red curve) at $a = 0.5$. One can use the gradient of $f$ to construct the linear approximation as in the blue line, which is close to $f$ when $x_1$ is close to $a$, but distant when $x_2$ is far from $a$.

Theorems & Definitions (5)

  • Remark 2.1
  • Remark 2.2
  • Remark 2.3
  • Proposition 3.1
  • Remark 3.2