Table of Contents
Fetching ...

Foundation Model-oriented Robustness: Robust Image Model Evaluation with Pretrained Models

Peiyan Zhang, Haoyang Liu, Chaozhuo Li, Xing Xie, Sunghun Kim, Haohan Wang

TL;DR

This paper tackles the misalignment between fixed benchmarks and real-world robustness by proposing a dynamic evaluation framework that treats a zoo of foundation models as surrogate oracles. It introduces a counterfactual-generation method, guided by a foundation-model ensemble, to perturb images while preserving the underlying image–label structure, and defines Foundation Model-oriented Robustness (FMR) to quantify robustness relative to the oracle. The authors conduct comprehensive experiments across standard and robust vision models on MNIST, CIFAR-10, and ImageNet, showing that transformer-based architectures and certain perturbation strategies yield higher FMR, while some existing robustness methods under dynamic evaluation falter. They also analyze biases, zero-shot limitations, and the transferability of generated perturbations, arguing that dynamic, foundation-model–driven evaluation offers a more credible and actionable picture of model robustness and guidance for future improvements.

Abstract

Machine learning has demonstrated remarkable performance over finite datasets, yet whether the scores over the fixed benchmarks can sufficiently indicate the model's performance in the real world is still in discussion. In reality, an ideal robust model will probably behave similarly to the oracle (e.g., the human users), thus a good evaluation protocol is probably to evaluate the models' behaviors in comparison to the oracle. In this paper, we introduce a new robustness measurement that directly measures the image classification model's performance compared with a surrogate oracle (i.e., a foundation model). Besides, we design a simple method that can accomplish the evaluation beyond the scope of the benchmarks. Our method extends the image datasets with new samples that are sufficiently perturbed to be distinct from the ones in the original sets, but are still bounded within the same image-label structure the original test image represents, constrained by a foundation model pretrained with a large amount of samples. As a result, our new method will offer us a new way to evaluate the models' robustness performance, free of limitations of fixed benchmarks or constrained perturbations, although scoped by the power of the oracle. In addition to the evaluation results, we also leverage our generated data to understand the behaviors of the model and our new evaluation strategies.

Foundation Model-oriented Robustness: Robust Image Model Evaluation with Pretrained Models

TL;DR

This paper tackles the misalignment between fixed benchmarks and real-world robustness by proposing a dynamic evaluation framework that treats a zoo of foundation models as surrogate oracles. It introduces a counterfactual-generation method, guided by a foundation-model ensemble, to perturb images while preserving the underlying image–label structure, and defines Foundation Model-oriented Robustness (FMR) to quantify robustness relative to the oracle. The authors conduct comprehensive experiments across standard and robust vision models on MNIST, CIFAR-10, and ImageNet, showing that transformer-based architectures and certain perturbation strategies yield higher FMR, while some existing robustness methods under dynamic evaluation falter. They also analyze biases, zero-shot limitations, and the transferability of generated perturbations, arguing that dynamic, foundation-model–driven evaluation offers a more credible and actionable picture of model robustness and guidance for future improvements.

Abstract

Machine learning has demonstrated remarkable performance over finite datasets, yet whether the scores over the fixed benchmarks can sufficiently indicate the model's performance in the real world is still in discussion. In reality, an ideal robust model will probably behave similarly to the oracle (e.g., the human users), thus a good evaluation protocol is probably to evaluate the models' behaviors in comparison to the oracle. In this paper, we introduce a new robustness measurement that directly measures the image classification model's performance compared with a surrogate oracle (i.e., a foundation model). Besides, we design a simple method that can accomplish the evaluation beyond the scope of the benchmarks. Our method extends the image datasets with new samples that are sufficiently perturbed to be distinct from the ones in the original sets, but are still bounded within the same image-label structure the original test image represents, constrained by a foundation model pretrained with a large amount of samples. As a result, our new method will offer us a new way to evaluate the models' robustness performance, free of limitations of fixed benchmarks or constrained perturbations, although scoped by the power of the oracle. In addition to the evaluation results, we also leverage our generated data to understand the behaviors of the model and our new evaluation strategies.
Paper Structure (48 sections, 1 theorem, 40 equations, 5 figures, 15 tables, 1 algorithm)

This paper contains 48 sections, 1 theorem, 40 equations, 5 figures, 15 tables, 1 algorithm.

Key Result

Proposition A.1

Under Assumptions I and II, we have estimators where $\leq$ holds element-wise.

Figures (5)

  • Figure 1: The main structure of our system to generate test images with foundation models and examples of the generated images with their effectiveness in evaluation of model's robustness.
  • Figure 2: Visualization of the images generated by our system in evaluating the common corruption robust model, with the original image shown (left image of each row). The caption for each image is either the original label or the predicted label by the corresponding model. The evaluated models are SIN, ANT, ANT+SIN, Augmix, DeepAug, DeepAug+AM and DAT from left to right.
  • Figure 3: Visualization of the perturbed images generated by our system in evaluating the vanilla model (middle image of each group) and the grayscale model (third image of each group), with the original image shown. The caption for each image is either the original label or the predicted label by the corresponding model.
  • Figure 4: Visualization of the images generated by our system in evaluating the common corruption robust model, with the original image shown (left image of each row). The caption for each image is either the original label or the predicted label by the corresponding model. The evaluated models are SIN, ANT, ANT+SIN, Augmix, DeepAug and DeepAug+AM from left to right.
  • Figure : Perturbed Image Generation with Foundation Models

Theorems & Definitions (2)

  • Proposition A.1
  • proof