When Vision Transformers Outperform ResNets without Pre-training or Strong Data Augmentations
Xiangning Chen, Cho-Jui Hsieh, Boqing Gong
TL;DR
This work analyzes Vision Transformers and MLP-Mixers trained from scratch on ImageNet, revealing they converge to sharp local minima and suffer optimization challenges without large-scale pretraining or heavy augmentations. It introduces Sharpness-Aware Minimization (SAM) to smooth the loss geometry, leading to flatter landscapes, improved generalization, and greater robustness across tasks. Empirically, SAM enables ViTs to outperform ResNets of similar or larger size without pretraining, and reveals intrinsic changes such as sparser early-layer activations and higher weight norms. The findings suggest a data-efficient pathway for convolution-free architectures and highlight the nuanced relationship between loss smoothness, architecture, and training dynamics.
Abstract
Vision Transformers (ViTs) and MLPs signal further efforts on replacing hand-wired features or inductive biases with general-purpose neural architectures. Existing works empower the models by massive data, such as large-scale pre-training and/or repeated strong data augmentations, and still report optimization-related problems (e.g., sensitivity to initialization and learning rates). Hence, this paper investigates ViTs and MLP-Mixers from the lens of loss geometry, intending to improve the models' data efficiency at training and generalization at inference. Visualization and Hessian reveal extremely sharp local minima of converged models. By promoting smoothness with a recently proposed sharpness-aware optimizer, we substantially improve the accuracy and robustness of ViTs and MLP-Mixers on various tasks spanning supervised, adversarial, contrastive, and transfer learning (e.g., +5.3\% and +11.0\% top-1 accuracy on ImageNet for ViT-B/16 and Mixer-B/16, respectively, with the simple Inception-style preprocessing). We show that the improved smoothness attributes to sparser active neurons in the first few layers. The resultant ViTs outperform ResNets of similar size and throughput when trained from scratch on ImageNet without large-scale pre-training or strong data augmentations. Model checkpoints are available at \url{https://github.com/google-research/vision_transformer}.
