Convolutional Initialization for Data-Efficient Vision Transformers

Jianqiao Zheng; Xueqian Li; Simon Lucey

Convolutional Initialization for Data-Efficient Vision Transformers

Jianqiao Zheng, Xueqian Li, Simon Lucey

TL;DR

This work tackles the data-efficiency challenge of Vision Transformers on small datasets by reinterpreting convolutional inductive bias as an initialization strategy. It introduces impulse-filter based initialization for self-attention, linking random impulse convolution to SoftMax attention and preserving ViT flexibility without architectural changes. Through Model I–III variants and targeted pretraining of $\mathbf{Q}$ and $\mathbf{K}$, the approach yields data-efficient ViTs that outperform random and mimetic baselines across CIFAR-10/100, SVHN, and Tiny-ImageNet and converge faster. The findings illuminate a theoretical bridge between CNN inductive bias and transformer initialization, offering a practical pathway to more accessible, data-efficient ViTs.

Abstract

Training vision transformer networks on small datasets poses challenges. In contrast, convolutional neural networks (CNNs) can achieve state-of-the-art performance by leveraging their architectural inductive bias. In this paper, we investigate whether this inductive bias can be reinterpreted as an initialization bias within a vision transformer network. Our approach is motivated by the finding that random impulse filters can achieve almost comparable performance to learned filters in CNNs. We introduce a novel initialization strategy for transformer networks that can achieve comparable performance to CNNs on small datasets while preserving its architectural flexibility.

Convolutional Initialization for Data-Efficient Vision Transformers

TL;DR

and

, the approach yields data-efficient ViTs that outperform random and mimetic baselines across CIFAR-10/100, SVHN, and Tiny-ImageNet and converge faster. The findings illuminate a theoretical bridge between CNN inductive bias and transformer initialization, offering a practical pathway to more accessible, data-efficient ViTs.

Abstract

Paper Structure (14 sections, 11 equations, 9 figures, 8 tables)

This paper contains 14 sections, 11 equations, 9 figures, 8 tables.

Introduction
Related Work
Why Random Filters?
Method
Impulse Filter Initialized ViT
Experiments
Spatial Mixing of ConvMixer
Modified Simple ViT
Impulse Initialized Self-Attention
Limitations
Conclusion
Additional Context on ConvMixer
Additional Results of ViT
Visualization of different initialization strategies

Figures (9)

Figure 1: The architectures of ConvMixer trockman2022patches and Simple ViT vit_baseline are quite similar. Both are composed of input patch embedding, several layers of spatial mixing and channel mixing blocks, and then pooling for future fully connected classifiers. Skip connections, BatchNorm or LayerNorm, and ReLU or GeLU are not shown in this fig for simplification. The only difference is the structure of the spatial mixing matrix. In ConvMixer it is in convolution form (the upper one) and each channel has a different filter, while in Simple ViT (the lower one) the channels are divided into heads, and in each head, the spatial mixing matrix is the same which is computed by SoftMax of two low-rank matrices $\mathbf{Q}$ and $\mathbf{K}$.
Figure 2: Illustration of why random spatial convolution filters are effective.
Figure 3: Our proposed strategy to initialize the weights of $\mathbf{Q}$ and $\mathbf{K}$ in the self-attention of transformers. The pink cells indicate ones, while the gray cells are zeros. Typically, the weights $\mathbf{Q}$ and $\mathbf{K}$ are randomly initialized, and after SoftMax, the attention map becomes a random permutation matrix, as indicated by the yellow arrow. We propose to build convolution matrices of random impulse filters first and then initialize $\mathbf{Q}$ and $\mathbf{K}$ such that the initial attention map is a random impulse convolution filter, as shown by the purple arrow.
Figure 4: Performance comparison of models that use learned, random, impulse, box, and rand* (random permutation) filters. Note that only the "learned" model has the spatial mixing weights that are trained, while all the other models have fixed weights.
Figure 5: Illustration of how scale $\sigma$ affects the attention map. In this $16 \,{\times}\, 16$ attention map example, $\mathbf{M} \,{=}\, \hbox{SoftMax}(\sigma\mathbf{Q}\mathbf{K}^{T})$, where $\mathbf{Q}$ and $\mathbf{K}$ are randomly initialized. (a) $\sigma \,{=}\, 1.0$, (b) $\sigma \,{=}\, 1e2$ and (c) $\sigma \,{=}\, 1e4$. Larger sigmas tend to result in binary attention maps, which makes the attention map close to random permutation matrices. This helps train the initial attention map to resemble the impulse convolution filter. However, excessively large $sigma$ values will make the following training much more difficult.
...and 4 more figures

Convolutional Initialization for Data-Efficient Vision Transformers

TL;DR

Abstract

Convolutional Initialization for Data-Efficient Vision Transformers

Authors

TL;DR

Abstract

Table of Contents

Figures (9)