Convolutional Initialization for Data-Efficient Vision Transformers
Jianqiao Zheng, Xueqian Li, Simon Lucey
TL;DR
This work tackles the data-efficiency challenge of Vision Transformers on small datasets by reinterpreting convolutional inductive bias as an initialization strategy. It introduces impulse-filter based initialization for self-attention, linking random impulse convolution to SoftMax attention and preserving ViT flexibility without architectural changes. Through Model I–III variants and targeted pretraining of $\mathbf{Q}$ and $\mathbf{K}$, the approach yields data-efficient ViTs that outperform random and mimetic baselines across CIFAR-10/100, SVHN, and Tiny-ImageNet and converge faster. The findings illuminate a theoretical bridge between CNN inductive bias and transformer initialization, offering a practical pathway to more accessible, data-efficient ViTs.
Abstract
Training vision transformer networks on small datasets poses challenges. In contrast, convolutional neural networks (CNNs) can achieve state-of-the-art performance by leveraging their architectural inductive bias. In this paper, we investigate whether this inductive bias can be reinterpreted as an initialization bias within a vision transformer network. Our approach is motivated by the finding that random impulse filters can achieve almost comparable performance to learned filters in CNNs. We introduce a novel initialization strategy for transformer networks that can achieve comparable performance to CNNs on small datasets while preserving its architectural flexibility.
