Understanding Robustness of Transformers for Image Classification
Srinadh Bhojanapalli, Ayan Chakrabarti, Daniel Glasner, Daliang Li, Thomas Unterthiner, Andreas Veit
TL;DR
This work conducts a comprehensive empirical study of Vision Transformers (ViT) versus ResNets to understand robustness in image classification. By evaluating ViTs and ResNets across natural corruptions, distribution shifts, adversarial perturbations, spatial transformations, and texture bias, the authors reveal that with large-scale pretraining ViTs achieve robustness on par with or better than ResNets, and they exhibit notable redundancy allowing significant pruning. The study also uncovers that patch size, attention locality, and CLS-token dynamics shape robustness, with larger patches increasing vulnerability to spatial attacks but larger data improving shape-biased predictions. Overall, robustness scales with data and model size for ViTs, offering practical guidance for deploying ViTs in real-world scenarios and informing architectural choices. The findings suggest ViTs can be highly robust in data-rich regimes and highlight avenues for efficiency via pruning and localized attention strategies.
Abstract
Deep Convolutional Neural Networks (CNNs) have long been the architecture of choice for computer vision tasks. Recently, Transformer-based architectures like Vision Transformer (ViT) have matched or even surpassed ResNets for image classification. However, details of the Transformer architecture -- such as the use of non-overlapping patches -- lead one to wonder whether these networks are as robust. In this paper, we perform an extensive study of a variety of different measures of robustness of ViT models and compare the findings to ResNet baselines. We investigate robustness to input perturbations as well as robustness to model perturbations. We find that when pre-trained with a sufficient amount of data, ViT models are at least as robust as the ResNet counterparts on a broad range of perturbations. We also find that Transformers are robust to the removal of almost any single layer, and that while activations from later layers are highly correlated with each other, they nevertheless play an important role in classification.
