Table of Contents
Fetching ...

Benchmarking Generative AI Models for Deep Learning Test Input Generation

Maryam, Matteo Biagiola, Andrea Stocco, Vincenzo Riccio

TL;DR

This work studies how Generative AI TIGs can be used to probe DL image classifiers beyond their training data. By standardizing a cross-architecture benchmarking framework that manipulates latent representations, it compares VAEs, GANs, and diffusion models across MNIST, SVHN, CIFAR-10, and ImageNet, with 364 human assessments. The results show that simpler models suffice for easy tasks, while diffusion models excel on complex datasets, achieving higher validity and label-preservation rates, albeit at greater computational cost. The findings offer practical guidance on selecting GenAI TIGs based on task complexity and highlight the trade-offs between efficiency, validity, and label preservation in automated test generation.

Abstract

Test Input Generators (TIGs) are crucial to assess the ability of Deep Learning (DL) image classifiers to provide correct predictions for inputs beyond their training and test sets. Recent advancements in Generative AI (GenAI) models have made them a powerful tool for creating and manipulating synthetic images, although these advancements also imply increased complexity and resource demands for training. In this work, we benchmark and combine different GenAI models with TIGs, assessing their effectiveness, efficiency, and quality of the generated test images, in terms of domain validity and label preservation. We conduct an empirical study involving three different GenAI architectures (VAEs, GANs, Diffusion Models), five classification tasks of increasing complexity, and 364 human evaluations. Our results show that simpler architectures, such as VAEs, are sufficient for less complex datasets like MNIST. However, when dealing with feature-rich datasets, such as ImageNet, more sophisticated architectures like Diffusion Models achieve superior performance by generating a higher number of valid, misclassification-inducing inputs.

Benchmarking Generative AI Models for Deep Learning Test Input Generation

TL;DR

This work studies how Generative AI TIGs can be used to probe DL image classifiers beyond their training data. By standardizing a cross-architecture benchmarking framework that manipulates latent representations, it compares VAEs, GANs, and diffusion models across MNIST, SVHN, CIFAR-10, and ImageNet, with 364 human assessments. The results show that simpler models suffice for easy tasks, while diffusion models excel on complex datasets, achieving higher validity and label-preservation rates, albeit at greater computational cost. The findings offer practical guidance on selecting GenAI TIGs based on task complexity and highlight the trade-offs between efficiency, validity, and label preservation in automated test generation.

Abstract

Test Input Generators (TIGs) are crucial to assess the ability of Deep Learning (DL) image classifiers to provide correct predictions for inputs beyond their training and test sets. Recent advancements in Generative AI (GenAI) models have made them a powerful tool for creating and manipulating synthetic images, although these advancements also imply increased complexity and resource demands for training. In this work, we benchmark and combine different GenAI models with TIGs, assessing their effectiveness, efficiency, and quality of the generated test images, in terms of domain validity and label preservation. We conduct an empirical study involving three different GenAI architectures (VAEs, GANs, Diffusion Models), five classification tasks of increasing complexity, and 364 human evaluations. Our results show that simpler architectures, such as VAEs, are sufficient for less complex datasets like MNIST. However, when dealing with feature-rich datasets, such as ImageNet, more sophisticated architectures like Diffusion Models achieve superior performance by generating a higher number of valid, misclassification-inducing inputs.

Paper Structure

This paper contains 27 sections, 2 equations, 3 figures, 2 tables, 1 algorithm.

Figures (3)

  • Figure 1: Misclassification-inducing test images for handwritten digit classifiers: (a) valid and label-preserving, (b) valid but not label-preserving, (c) invalid.
  • Figure 2: Summary of the GenAI models considered in this paper and the process by which TIGs perturb their latent vectors.
  • Figure 3: Misclassification-inducing images generated by GenAI TIGs