Table of Contents
Fetching ...

Understanding Robustness of Transformers for Image Classification

Srinadh Bhojanapalli, Ayan Chakrabarti, Daniel Glasner, Daliang Li, Thomas Unterthiner, Andreas Veit

TL;DR

This work conducts a comprehensive empirical study of Vision Transformers (ViT) versus ResNets to understand robustness in image classification. By evaluating ViTs and ResNets across natural corruptions, distribution shifts, adversarial perturbations, spatial transformations, and texture bias, the authors reveal that with large-scale pretraining ViTs achieve robustness on par with or better than ResNets, and they exhibit notable redundancy allowing significant pruning. The study also uncovers that patch size, attention locality, and CLS-token dynamics shape robustness, with larger patches increasing vulnerability to spatial attacks but larger data improving shape-biased predictions. Overall, robustness scales with data and model size for ViTs, offering practical guidance for deploying ViTs in real-world scenarios and informing architectural choices. The findings suggest ViTs can be highly robust in data-rich regimes and highlight avenues for efficiency via pruning and localized attention strategies.

Abstract

Deep Convolutional Neural Networks (CNNs) have long been the architecture of choice for computer vision tasks. Recently, Transformer-based architectures like Vision Transformer (ViT) have matched or even surpassed ResNets for image classification. However, details of the Transformer architecture -- such as the use of non-overlapping patches -- lead one to wonder whether these networks are as robust. In this paper, we perform an extensive study of a variety of different measures of robustness of ViT models and compare the findings to ResNet baselines. We investigate robustness to input perturbations as well as robustness to model perturbations. We find that when pre-trained with a sufficient amount of data, ViT models are at least as robust as the ResNet counterparts on a broad range of perturbations. We also find that Transformers are robust to the removal of almost any single layer, and that while activations from later layers are highly correlated with each other, they nevertheless play an important role in classification.

Understanding Robustness of Transformers for Image Classification

TL;DR

This work conducts a comprehensive empirical study of Vision Transformers (ViT) versus ResNets to understand robustness in image classification. By evaluating ViTs and ResNets across natural corruptions, distribution shifts, adversarial perturbations, spatial transformations, and texture bias, the authors reveal that with large-scale pretraining ViTs achieve robustness on par with or better than ResNets, and they exhibit notable redundancy allowing significant pruning. The study also uncovers that patch size, attention locality, and CLS-token dynamics shape robustness, with larger patches increasing vulnerability to spatial attacks but larger data improving shape-biased predictions. Overall, robustness scales with data and model size for ViTs, offering practical guidance for deploying ViTs in real-world scenarios and informing architectural choices. The findings suggest ViTs can be highly robust in data-rich regimes and highlight avenues for efficiency via pruning and localized attention strategies.

Abstract

Deep Convolutional Neural Networks (CNNs) have long been the architecture of choice for computer vision tasks. Recently, Transformer-based architectures like Vision Transformer (ViT) have matched or even surpassed ResNets for image classification. However, details of the Transformer architecture -- such as the use of non-overlapping patches -- lead one to wonder whether these networks are as robust. In this paper, we perform an extensive study of a variety of different measures of robustness of ViT models and compare the findings to ResNet baselines. We investigate robustness to input perturbations as well as robustness to model perturbations. We find that when pre-trained with a sufficient amount of data, ViT models are at least as robust as the ResNet counterparts on a broad range of perturbations. We also find that Transformers are robust to the removal of almost any single layer, and that while activations from later layers are highly correlated with each other, they nevertheless play an important role in classification.

Paper Structure

This paper contains 32 sections, 34 figures, 5 tables.

Figures (34)

  • Figure 1: Transformers vs. ResNets. While they achieve similar performance for image classification, Transformer and ResNet architectures process their inputs very differently. Shown here are adversarial perturbations computed for a Transformer and a ResNet model, which are qualitatively quite different.
  • Figure 2: Robustness Benchmarks. Accuracy of ViT and ResNet models on ILSVRC-2012 (clean), ImageNet-C, ImageNet-R and ImageNet-A. For ImageNet-C the accuracy is averaged across all corruption types and severity levels. We observe that (i) relative accuracy on ILSVRC-2012 is generally predictive of relative accuracy on the perturbed datasets, and that when trained on sufficient data, the accuracy of ViT models (ii) outperforms ResNets, and (iii) scales better with model size. Marker size related to model size. Detailed results for ImageNet-C can be found in Appendix \ref{['appendix:detailed_results']}.
  • Figure 3: Adversarial Perturbations. Accuracy on a subset of 1000 images in ILSVRC-2012 validation of ViT and ResNet models, on clean images (left) vs. those subject to model-specific adversarial attacks: FGSM and PGD-based perturbations (middle), and spatial (rotation and translation) transformations (right). (We omit ViT-H/14 here, since it expects a different input image resolution than the other models.) ResNet models are more robust to the simpler FGSM attack than their ViT counterparts, but this advantage disappears for the more successful PGD attacks. For spatial attacks, the $16\times16$ ViT models exhibit equivalent robustness to ResNets of comparable size, but ViT models with the larger patch-size of $32\times32$ fare worse.
  • Figure 4: Scaling. Performance of ViT and ResNet models as a function of the number of model parameters. All models are pre-trained on JFT-300M and fine-tuned on ILSVRC-2012. We see consistent trends across different input perturbations: scaling up ViTs provides better robustness gains than scaling up ResNets.
  • Figure 5: Example Perturbations. For example images from the ILSVRC 2012 validation set, we illustrate the perturbations computed using PGD for two ViT models and two ResNet models (we use models pre-trained on JFT-300M). The perturbations are visualized as images by linearly transforming their intensity from the original range of $[-1, 1]$ to $[0, 255]$.
  • ...and 29 more figures