Structured Initialization for Vision Transformers
Jianqiao Zheng, Xueqian Li, Hemanth Saratchandran, Simon Lucey
TL;DR
This work tackles the underperformance of Vision Transformers on small datasets by introducing a structured initialization that injects CNN-like locality into the attention mechanism without changing the ViT architecture. The method uses random impulse convolution filters to initialize the attention maps (Q and K) so that initial spatial mixing resembles a convolution, with a rank-based justification and a practical SVD-based solution. Empirically, the impulse initialization improves performance on small and medium datasets, maintains competitive results on large-scale data, and transfers across ViT variants such as Swin Transformer and MLP-Mixer, while providing interpretable attention-map patterns. The approach offers faster convergence and robustness to hyperparameter variations, with potential applicability to domain-specific tasks where data is limited, and it preserves the transformative flexibility of ViTs at scale.
Abstract
Convolutional Neural Networks (CNNs) inherently encode strong inductive biases, enabling effective generalization on small-scale datasets. In this paper, we propose integrating this inductive bias into ViTs, not through an architectural intervention but solely through initialization. The motivation here is to have a ViT that can enjoy strong CNN-like performance when data assets are small, but can still scale to ViT-like performance as the data expands. Our approach is motivated by our empirical results that random impulse filters can achieve commensurate performance to learned filters within a CNN. We improve upon current ViT initialization strategies, which typically rely on empirical heuristics such as using attention weights from pretrained models or focusing on the distribution of attention weights without enforcing structures. Empirical results demonstrate that our method significantly outperforms standard ViT initialization across numerous small and medium-scale benchmarks, including Food-101, CIFAR-10, CIFAR-100, STL-10, Flowers, and Pets, while maintaining comparative performance on large-scale datasets such as ImageNet-1K. Moreover, our initialization strategy can be easily integrated into various transformer-based architectures such as Swin Transformer and MLP-Mixer with consistent improvements in performance.
