Table of Contents
Fetching ...

Efficient Exploration of Image Classifier Failures with Bayesian Optimization and Text-to-Image Models

Adrien LeCoz, Houssem Ouertatani, Stéphane Herbin, Faouzi Adjed

TL;DR

The paper tackles reliable image classifier evaluation under distribution shift by leveraging text-conditioned diffusion models to synthesize failure-inducing images. It introduces an iterative pipeline that alternates image generation, classifier evaluation, and attribute-based subdomain selection, with selection guided by GA or Bayesian optimization. Empirical results show that GA and BO outperform random and combinatorial testing, enabling efficient identification of high-risk subdomains despite the substantial generation cost. The approach improves interpretability of failures by tying them to textual attributes and points to future gains from multi-fidelity evaluations and embedding-based representations to scale benchmarking.

Abstract

Image classifiers should be used with caution in the real world. Performance evaluated on a validation set may not reflect performance in the real world. In particular, classifiers may perform well for conditions that are frequently encountered during training, but poorly for other infrequent conditions. In this study, we hypothesize that recent advances in text-to-image generative models make them valuable for benchmarking computer vision models such as image classifiers: they can generate images conditioned by textual prompts that cause classifier failures, allowing failure conditions to be described with textual attributes. However, their generation cost becomes an issue when a large number of synthetic images need to be generated, which is the case when many different attribute combinations need to be tested. We propose an image classifier benchmarking method as an iterative process that alternates image generation, classifier evaluation, and attribute selection. This method efficiently explores the attributes that ultimately lead to poor behavior detection.

Efficient Exploration of Image Classifier Failures with Bayesian Optimization and Text-to-Image Models

TL;DR

The paper tackles reliable image classifier evaluation under distribution shift by leveraging text-conditioned diffusion models to synthesize failure-inducing images. It introduces an iterative pipeline that alternates image generation, classifier evaluation, and attribute-based subdomain selection, with selection guided by GA or Bayesian optimization. Empirical results show that GA and BO outperform random and combinatorial testing, enabling efficient identification of high-risk subdomains despite the substantial generation cost. The approach improves interpretability of failures by tying them to textual attributes and points to future gains from multi-fidelity evaluations and embedding-based representations to scale benchmarking.

Abstract

Image classifiers should be used with caution in the real world. Performance evaluated on a validation set may not reflect performance in the real world. In particular, classifiers may perform well for conditions that are frequently encountered during training, but poorly for other infrequent conditions. In this study, we hypothesize that recent advances in text-to-image generative models make them valuable for benchmarking computer vision models such as image classifiers: they can generate images conditioned by textual prompts that cause classifier failures, allowing failure conditions to be described with textual attributes. However, their generation cost becomes an issue when a large number of synthetic images need to be generated, which is the case when many different attribute combinations need to be tested. We propose an image classifier benchmarking method as an iterative process that alternates image generation, classifier evaluation, and attribute selection. This method efficiently explores the attributes that ultimately lead to poor behavior detection.
Paper Structure (31 sections, 7 figures, 1 table, 1 algorithm)

This paper contains 31 sections, 7 figures, 1 table, 1 algorithm.

Figures (7)

  • Figure 1: Illustration of our method that alternates generation, evaluation, and selection. The selection function selects the next subdomain to evaluate, based on the feedback of the previous subdomains evaluated. With the right choice of selection function, an efficient exploration of the evaluation domain is achieved.
  • Figure 2: Samples of generated images with their associated prompt. Images on the top row are classified as dogs, while those at the bottom are not. Note that some biases of the generative model appear: sunglasses at the beach and an umbrella when raining.
  • Figure 3: 3-wise testing selects 61 subdomains to evaluate. Most of them are high-accuracy. We compare that to the other methods when allowed to explore 61 subdomains. GA and Bayesian optimization identify much more low-accuracy subdomains.
  • Figure 4: Different metrics to compare the quality of the subdomain selection when iterating on the loop generation, evaluation, and selection. In general, combinatorial testing is not much better than random selection, and it only gives a few options for the number of subdomains selected. GA and BO are much more efficient and can explore any given number of subdomains according to the computation time available. Note that the x-axis of \ref{['fig:acc']}, \ref{['fig:avg_acc']}, and \ref{['fig:cov']} could be replaced by GPU.hours going from 0 to $\approx$ 200 as mentioned in Subsection \ref{['subsec:eval_all']}. All plots are averages over 10 seeds and the standard deviations are shown.
  • Figure 5: Average accuracies for each value of each attribute. The 95% confidence interval is also shown.
  • ...and 2 more figures