Table of Contents
Fetching ...

Test Input Validation for Vision-based DL Systems: An Active Learning Approach

Delaram Ghobari, Mohammad Hossein Amini, Dai Quoc Tran, Seunghee Park, Shiva Nejati, Mehrdad Sabetzadeh

TL;DR

This work tackles the challenge of validating test inputs for vision-based DL systems, where synthetic transformations risk producing invalid inputs that distort evaluation. It introduces HiL-TV, a human-in-the-loop validation method that leverages 13 image-comparison metrics and an active-learning loop to automatically classify valid versus invalid transformed image pairs, with manual labeling triggered only when classifier confidence is low. Across industrial and public datasets, HiL-TV demonstrates substantial gains in accuracy and precision over single-metric baselines, achieving an average accuracy of about 97% and an overall improvement of at least 12.9% versus baselines. The approach provides practical trade-off control via parameters $\alpha$, $\beta$, and $|D_v|$, and offers strong evidence that combining multiple metrics with active learning yields robust, scalable test input validation for real-world vision DL systems.

Abstract

Testing deep learning (DL) systems requires extensive and diverse, yet valid, test inputs. While synthetic test input generation methods, such as metamorphic testing, are widely used for DL testing, they risk introducing invalid inputs that do not accurately reflect real-world scenarios. Invalid test inputs can lead to misleading results. Hence, there is a need for automated validation of test inputs to ensure effective assessment of DL systems. In this paper, we propose a test input validation approach for vision-based DL systems. Our approach uses active learning to balance the trade-off between accuracy and the manual effort required for test input validation. Further, by employing multiple image-comparison metrics, it achieves better results in classifying valid and invalid test inputs compared to methods that rely on single metrics. We evaluate our approach using an industrial and a public-domain dataset. Our evaluation shows that our multi-metric, active learning-based approach produces several optimal accuracy-effort trade-offs, including those deemed practical and desirable by our industry partner. Furthermore, provided with the same level of manual effort, our approach is significantly more accurate than two state-of-the-art test input validation methods, achieving an average accuracy of 97%. Specifically, the use of multiple metrics, rather than a single metric, results in an average improvement of at least 5.4% in overall accuracy compared to the state-of-the-art baselines. Incorporating an active learning loop for test input validation yields an additional 7.5% improvement in average accuracy, bringing the overall average improvement of our approach to at least 12.9% compared to the baselines.

Test Input Validation for Vision-based DL Systems: An Active Learning Approach

TL;DR

This work tackles the challenge of validating test inputs for vision-based DL systems, where synthetic transformations risk producing invalid inputs that distort evaluation. It introduces HiL-TV, a human-in-the-loop validation method that leverages 13 image-comparison metrics and an active-learning loop to automatically classify valid versus invalid transformed image pairs, with manual labeling triggered only when classifier confidence is low. Across industrial and public datasets, HiL-TV demonstrates substantial gains in accuracy and precision over single-metric baselines, achieving an average accuracy of about 97% and an overall improvement of at least 12.9% versus baselines. The approach provides practical trade-off control via parameters , , and , and offers strong evidence that combining multiple metrics with active learning yields robust, scalable test input validation for real-world vision DL systems.

Abstract

Testing deep learning (DL) systems requires extensive and diverse, yet valid, test inputs. While synthetic test input generation methods, such as metamorphic testing, are widely used for DL testing, they risk introducing invalid inputs that do not accurately reflect real-world scenarios. Invalid test inputs can lead to misleading results. Hence, there is a need for automated validation of test inputs to ensure effective assessment of DL systems. In this paper, we propose a test input validation approach for vision-based DL systems. Our approach uses active learning to balance the trade-off between accuracy and the manual effort required for test input validation. Further, by employing multiple image-comparison metrics, it achieves better results in classifying valid and invalid test inputs compared to methods that rely on single metrics. We evaluate our approach using an industrial and a public-domain dataset. Our evaluation shows that our multi-metric, active learning-based approach produces several optimal accuracy-effort trade-offs, including those deemed practical and desirable by our industry partner. Furthermore, provided with the same level of manual effort, our approach is significantly more accurate than two state-of-the-art test input validation methods, achieving an average accuracy of 97%. Specifically, the use of multiple metrics, rather than a single metric, results in an average improvement of at least 5.4% in overall accuracy compared to the state-of-the-art baselines. Incorporating an active learning loop for test input validation yields an additional 7.5% improvement in average accuracy, bringing the overall average improvement of our approach to at least 12.9% compared to the baselines.
Paper Structure (16 sections, 4 figures, 6 tables, 1 algorithm)

This paper contains 16 sections, 4 figures, 6 tables, 1 algorithm.

Figures (4)

  • Figure 1: An original power-grid image (a) and its transformations to replicate snowy conditions: (b) is a valid transformation of (a), while (c) and (d) are invalid transformations of (a).
  • Figure 2: Accuracy vs. human effort trade-offs for various configurations of HiL-TV, based on different classifiers, parameter values of $\alpha$ and $\beta$, and sizes of the pre-validated subset $D_v$.
  • Figure 3: Decision trees identifying classification techniques and parameter values that yield acceptable trade-offs between validation accuracy and human effort for our industry partner. At each node, N = [x, y] indicates the count of unacceptable (x) and acceptable (y) configurations reaching that point.
  • Figure 4: Accuracy, precision, and recall distributions across 20 runs for HiL-TV, HiL-TV without active learning, and the two baselines, B-VIF and B-VAE, at varying levels of human effort for both the industry (left) and public (right) datasets.