Table of Contents
Fetching ...

Can Search-Based Testing with Pareto Optimization Effectively Cover Failure-Revealing Test Inputs?

Lev Sorokin, Damir Safin, Shiva Nejati

TL;DR

A theoretical argument explaining why testing based on Pareto optimization is inadequate for covering failure-inducing areas within a search domain is presented and empirical results obtained are supported with results obtained by applying two widely used types of Pareto-based optimization techniques.

Abstract

Search-based software testing (SBST) is a widely adopted technique for testing complex systems with large input spaces, such as Deep Learning-enabled (DL-enabled) systems. Many SBST techniques focus on Pareto-based optimization, where multiple objectives are optimized in parallel to reveal failures. However, it is important to ensure that identified failures are spread throughout the entire failure-inducing area of a search domain and not clustered in a sub-region. This ensures that identified failures are semantically diverse and reveal a wide range of underlying causes. In this paper, we present a theoretical argument explaining why testing based on Pareto optimization is inadequate for covering failure-inducing areas within a search domain. We support our argument with empirical results obtained by applying two widely used types of Pareto-based optimization techniques, namely NSGA-II (an evolutionary algorithm) and OMOPSO (a swarm-based Pareto-optimization algorithm), to two DL-enabled systems: an industrial Automated Valet Parking (AVP) system and a system for classifying handwritten digits. We measure the coverage of failure-revealing test inputs in the input space using a metric that we refer to as the Coverage Inverted Distance quality indicator. Our results show that NSGA-II-based search and OMOPSO are not more effective than a naïve random search baseline in covering test inputs that reveal failures. The replication package for this study is available in a GitHub repository.

Can Search-Based Testing with Pareto Optimization Effectively Cover Failure-Revealing Test Inputs?

TL;DR

A theoretical argument explaining why testing based on Pareto optimization is inadequate for covering failure-inducing areas within a search domain is presented and empirical results obtained are supported with results obtained by applying two widely used types of Pareto-based optimization techniques.

Abstract

Search-based software testing (SBST) is a widely adopted technique for testing complex systems with large input spaces, such as Deep Learning-enabled (DL-enabled) systems. Many SBST techniques focus on Pareto-based optimization, where multiple objectives are optimized in parallel to reveal failures. However, it is important to ensure that identified failures are spread throughout the entire failure-inducing area of a search domain and not clustered in a sub-region. This ensures that identified failures are semantically diverse and reveal a wide range of underlying causes. In this paper, we present a theoretical argument explaining why testing based on Pareto optimization is inadequate for covering failure-inducing areas within a search domain. We support our argument with empirical results obtained by applying two widely used types of Pareto-based optimization techniques, namely NSGA-II (an evolutionary algorithm) and OMOPSO (a swarm-based Pareto-optimization algorithm), to two DL-enabled systems: an industrial Automated Valet Parking (AVP) system and a system for classifying handwritten digits. We measure the coverage of failure-revealing test inputs in the input space using a metric that we refer to as the Coverage Inverted Distance quality indicator. Our results show that NSGA-II-based search and OMOPSO are not more effective than a naïve random search baseline in covering test inputs that reveal failures. The replication package for this study is available in a GitHub repository.

Paper Structure

This paper contains 20 sections, 21 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Illustrating the limitation of Pareto-based optimization algorithms in covering failures: Figure a) illustrates how a Pareto front (PF) and a codomain of interest (COI) are located in the objective space. Figure b) shows how a Pareto set (PS) and a domain of interest (DOI) are located in the search space. The dimensionalities of PF and PS are, respectively, lower than the dimensionalities of COI and DOI. Hence, PF cannot effectively cover COI, and PS cannot effectively cover DOI.
  • Figure 2: Illustrative examples showing the computation of CID: DOI is represented as two separate regions, highlighted in pink in the three examples. The test set, $A$, is presented as solid red points in all three examples. The reference set, $Z$, is represented by the non-filled pink circles. The example in (a): $A$ barely covers DOI; the example in (b): $A$ covers one region of DOI well; and the example in (c): $A$ covers both regions of DOI well.
  • Figure 3: The test scenario for the AVP Case Study: A pedestrian is crossing the trajectory of a vehicle equipped with the automated valet parking system.
  • Figure 4: Example test inputs for the MNIST case study. Left column: original digits, correctly classified as 5. Right column: corresponding label-preserving digits generated by NSGA-II and labeled as 8 by the classifier under test. Classification certainty differences between the expected label and the highest prediction by the classifier for the mutated digits from the top right to the bottom right are: -0.51, -0.79, -0.69.
  • Figure 5: MNIST Case Study. The average and standard deviations of CID values obtained from 10 runs of RS, NSGA-II, NSGA-II-D and OMOPSO for the three test oracle functions $O_{Large}$, $O_{Medium}$, and $O_{Small}$ of MNIST. The CID values are plotted after every 100 evaluations. The reference set is computed using 15,625 uniformly generated test inputs using Grid Sampling.
  • ...and 3 more figures

Theorems & Definitions (2)

  • definition thmcounterdefinition
  • definition thmcounterdefinition: Pareto-based Optimization