Table of Contents
Fetching ...

Structured Initialization for Vision Transformers

Jianqiao Zheng, Xueqian Li, Hemanth Saratchandran, Simon Lucey

TL;DR

This work tackles the underperformance of Vision Transformers on small datasets by introducing a structured initialization that injects CNN-like locality into the attention mechanism without changing the ViT architecture. The method uses random impulse convolution filters to initialize the attention maps (Q and K) so that initial spatial mixing resembles a convolution, with a rank-based justification and a practical SVD-based solution. Empirically, the impulse initialization improves performance on small and medium datasets, maintains competitive results on large-scale data, and transfers across ViT variants such as Swin Transformer and MLP-Mixer, while providing interpretable attention-map patterns. The approach offers faster convergence and robustness to hyperparameter variations, with potential applicability to domain-specific tasks where data is limited, and it preserves the transformative flexibility of ViTs at scale.

Abstract

Convolutional Neural Networks (CNNs) inherently encode strong inductive biases, enabling effective generalization on small-scale datasets. In this paper, we propose integrating this inductive bias into ViTs, not through an architectural intervention but solely through initialization. The motivation here is to have a ViT that can enjoy strong CNN-like performance when data assets are small, but can still scale to ViT-like performance as the data expands. Our approach is motivated by our empirical results that random impulse filters can achieve commensurate performance to learned filters within a CNN. We improve upon current ViT initialization strategies, which typically rely on empirical heuristics such as using attention weights from pretrained models or focusing on the distribution of attention weights without enforcing structures. Empirical results demonstrate that our method significantly outperforms standard ViT initialization across numerous small and medium-scale benchmarks, including Food-101, CIFAR-10, CIFAR-100, STL-10, Flowers, and Pets, while maintaining comparative performance on large-scale datasets such as ImageNet-1K. Moreover, our initialization strategy can be easily integrated into various transformer-based architectures such as Swin Transformer and MLP-Mixer with consistent improvements in performance.

Structured Initialization for Vision Transformers

TL;DR

This work tackles the underperformance of Vision Transformers on small datasets by introducing a structured initialization that injects CNN-like locality into the attention mechanism without changing the ViT architecture. The method uses random impulse convolution filters to initialize the attention maps (Q and K) so that initial spatial mixing resembles a convolution, with a rank-based justification and a practical SVD-based solution. Empirically, the impulse initialization improves performance on small and medium datasets, maintains competitive results on large-scale data, and transfers across ViT variants such as Swin Transformer and MLP-Mixer, while providing interpretable attention-map patterns. The approach offers faster convergence and robustness to hyperparameter variations, with potential applicability to domain-specific tasks where data is limited, and it preserves the transformative flexibility of ViTs at scale.

Abstract

Convolutional Neural Networks (CNNs) inherently encode strong inductive biases, enabling effective generalization on small-scale datasets. In this paper, we propose integrating this inductive bias into ViTs, not through an architectural intervention but solely through initialization. The motivation here is to have a ViT that can enjoy strong CNN-like performance when data assets are small, but can still scale to ViT-like performance as the data expands. Our approach is motivated by our empirical results that random impulse filters can achieve commensurate performance to learned filters within a CNN. We improve upon current ViT initialization strategies, which typically rely on empirical heuristics such as using attention weights from pretrained models or focusing on the distribution of attention weights without enforcing structures. Empirical results demonstrate that our method significantly outperforms standard ViT initialization across numerous small and medium-scale benchmarks, including Food-101, CIFAR-10, CIFAR-100, STL-10, Flowers, and Pets, while maintaining comparative performance on large-scale datasets such as ImageNet-1K. Moreover, our initialization strategy can be easily integrated into various transformer-based architectures such as Swin Transformer and MLP-Mixer with consistent improvements in performance.

Paper Structure

This paper contains 32 sections, 4 theorems, 17 equations, 7 figures, 8 tables, 1 algorithm.

Key Result

Proposition 1

A ConvMixer block $\mathbf{T}$ consists of a spatial mixing layer $\mathbf{T}_{S}(\:\cdot\:;\mathbf{H})$ with convolution filters $\mathbf{H}$ and a channel mixing layer $\mathbf{T}_{C}$. $\mathbf{T}'$ is another ConvMixer block composed of $\mathbf{T}'_{S}(\:\cdot\:;\mathbf{H}')$ and $\mathbf{T}'_{

Figures (7)

  • Figure 1: Illustration of conventional generative initialization and structured initialization (ours) strategies for the weights $\mathbf{Q}$ and $\mathbf{K}$ of the attention map in transformers. Conventional generative initialization involves sampling $\mathbf{Q}$ and $\mathbf{K}$ from certain distributions, such as Gaussian or Uniform, resulting in unstructured attention maps. In contrast, our structured initialization imposes constraints on the structure of the initial attention maps, specifically requiring them to be random impulse filters. The initialization of $\mathbf{Q}$ and $\mathbf{K}$ is computed based on this requirement. Note that in both attention maps and random impulse filters, the pink cells indicate ones, while the gray cells represent zeros.
  • Figure 2: Illustration of why random spatial convolution filters are effective. Patch embeddings $\mathbf{X}\,{\in}\,\mathbb{R}^{N{\times}D}$ are typically rank-deficient and can be approximately decomposed to $k$ basis. Meanwhile, a linear combination of $f^{2}$ linearly independent filters $\mathbf{h}$ can express any arbitrary filter in the filter space $\mathbb{R}^{f {\times} f}$. Based on these two observations, we derive the inequality $D\,{\geq}\,k f^2$ from Proposition \ref{['prop:convmixer']}.
  • Figure 3: Training curves of ViT-Tiny on CIFAR-10 and ViT-Base on ImageNet-1K using default, mimetic, and impulse initialization. The zoomed-in box shows the training curve in the final training stage from epoch 200 to epoch 300.
  • Figure 4: Visualization of attention maps in ViT-T using ours, mimetic trockman2023mimetic, and default xu2024initializing initializations. Red boxes highlight zoomed-in details of the $16\,{\times}\,16$ upper left corner in attention maps. White boxes indicate the $8\,{\times}\,8$ sub-blocks of the zoomed-in attention maps. Our structured initialization method offers distinct attention peaks aligned with the impulse structures across different heads. Head 1 offers a peak at $+1$ offset from the main diagonal. Head 2 offers a peek at $-8$ offset (equivalent to $-\text{image\_size}$) from the main diagonal. Both mimetic and random initialization methods initialize all the attention heads identically. Specifically, mimetic initialization primarily strengthens the main diagonal of the attention map for each head, while random initialization shows minimal structural patterns with flatter peak values.
  • Figure 5: Training curves of ViT-Base using default, mimetic, and impulse initialization under three different training configurations. The zoomed-in box shows the training curve in the final training stage from epoch 200 to epoch 300.
  • ...and 2 more figures

Theorems & Definitions (6)

  • Remark 1
  • Definition 1
  • Proposition 1
  • Corollary 1
  • Corollary 2
  • Corollary 3