Table of Contents
Fetching ...

Structured Initialization for Attention in Vision Transformers

Jianqiao Zheng, Xueqian Li, Simon Lucey

TL;DR

This work tackles ViTs’ data-inefficiency on small datasets by recasting CNN inductive bias as a structured initialization for ViT attention. It introduces convolution-inspired impulse filters initialized through an optimization over a pseudo input, yielding an attention map initialization $M_{\text{init}}$ that approximates a convolutional impulse matrix $\mathbf{H}_{\text{impulse}}$ via $M_{\text{init}} = \mathrm{softmax}(\tilde{\mathbf{X}}\mathbf{Q}_{\text{init}}\mathbf{K}_{\text{init}}^{T}\tilde{\mathbf{X}}^{T}) \approx \mathbf{H}_{\text{impulse}}$. The optimization for $\mathbf{Q}_{\text{init}}$ and $\mathbf{K}_{\text{init}}$ uses gradient descent with a fixed pseudo input and MSE loss, serving as a fast surrogate for SVD and avoiding offline pretraining. Empirically, the impulse-initialized ViTs achieve state-of-the-art data-efficient performance on CIFAR-10/100 and SVHN, while maintaining strong performance on ImageNet-1K, and provide interpretable attention maps that resemble convolutional structure. This initialization preserves the flexibility of transformers for large-scale data and offers a principled alternative to mimetic or architectural approaches for injecting CNN-like inductive bias.

Abstract

The training of vision transformer (ViT) networks on small-scale datasets poses a significant challenge. By contrast, convolutional neural networks (CNNs) have an architectural inductive bias enabling them to perform well on such problems. In this paper, we argue that the architectural bias inherent to CNNs can be reinterpreted as an initialization bias within ViT. This insight is significant as it empowers ViTs to perform equally well on small-scale problems while maintaining their flexibility for large-scale applications. Our inspiration for this ``structured'' initialization stems from our empirical observation that random impulse filters can achieve comparable performance to learned filters within CNNs. Our approach achieves state-of-the-art performance for data-efficient ViT learning across numerous benchmarks including CIFAR-10, CIFAR-100, and SVHN.

Structured Initialization for Attention in Vision Transformers

TL;DR

This work tackles ViTs’ data-inefficiency on small datasets by recasting CNN inductive bias as a structured initialization for ViT attention. It introduces convolution-inspired impulse filters initialized through an optimization over a pseudo input, yielding an attention map initialization that approximates a convolutional impulse matrix via . The optimization for and uses gradient descent with a fixed pseudo input and MSE loss, serving as a fast surrogate for SVD and avoiding offline pretraining. Empirically, the impulse-initialized ViTs achieve state-of-the-art data-efficient performance on CIFAR-10/100 and SVHN, while maintaining strong performance on ImageNet-1K, and provide interpretable attention maps that resemble convolutional structure. This initialization preserves the flexibility of transformers for large-scale data and offers a principled alternative to mimetic or architectural approaches for injecting CNN-like inductive bias.

Abstract

The training of vision transformer (ViT) networks on small-scale datasets poses a significant challenge. By contrast, convolutional neural networks (CNNs) have an architectural inductive bias enabling them to perform well on such problems. In this paper, we argue that the architectural bias inherent to CNNs can be reinterpreted as an initialization bias within ViT. This insight is significant as it empowers ViTs to perform equally well on small-scale problems while maintaining their flexibility for large-scale applications. Our inspiration for this ``structured'' initialization stems from our empirical observation that random impulse filters can achieve comparable performance to learned filters within CNNs. Our approach achieves state-of-the-art performance for data-efficient ViT learning across numerous benchmarks including CIFAR-10, CIFAR-100, and SVHN.
Paper Structure (20 sections, 1 theorem, 12 equations, 5 figures, 8 tables, 1 algorithm)

This paper contains 20 sections, 1 theorem, 12 equations, 5 figures, 8 tables, 1 algorithm.

Key Result

Proposition 1

In a ConvMixer block composed of a spatial mixing layer and a channel mixing layer, suppose $D$ is the number of channels, $k$ is the rank of input $\mathbf{X}$, and $f{\times}f$ is the number of convolution filters basis, then any possible output for $f {\times} f$ filters can be achieved by only l

Figures (5)

  • Figure 1: Illustration comparing conventional generative initialization with structured initialization (ours) strategy for the weights $\mathbf{Q}$ and $\mathbf{K}$ of the self-attention in transformers. Conventional generative initialization involves sampling parameters $\mathbf{Q}$ and $\mathbf{K}$ from certain distributions, such as Gaussian or Uniform, resulting in unstructured initial attention maps. In contrast, our structured initialization strategy imposes constraints on the initial structure of the attention maps, specifically requiring them to be random impulse filters. The initialization of parameters $\mathbf{Q}$ and $\mathbf{K}$ is computed based on this requirement on attention maps. Note that in both attention maps and random impulse filters, the pink cells indicate ones, while the gray cells represent zeros.
  • Figure 2: Illustration of why random spatial convolution filters are effective. Patch embeddings $\mathbf{X}\,{\in}\,\mathbb{R}^{N{\times}D}$ are typically rank-deficient and can be approximately decomposed to $k$ basis. Meanwhile, a linear combination of $f^{2}$ linearly independent filters $\mathbf{h}$ can express any arbitrary filter in the filter space $\mathbb{R}^{f {\times} f}$. Based on these two observations, we derive the inequality $D\,{\geq}\,k f^2$ in \ref{['prop:convmixer']}.
  • Figure 3: Visualization of attention maps in ViT-T using our impulse initialization method, mimetic trockman2023mimetic, and random liu2022convnet initializations. Red boxes highlight zoomed-in details of the $48{\times}48$ upper left corner in attention maps. White boxes indicate the main diagonal blocks of the zoomed-in attention maps. Our structured initialization method offers off-diagonal attention peaks aligned with the impulse structures, whereas mimetic initialization primarily strengthens the main diagonal of the attention map. Random initialization shows little to no patterns.
  • Figure 4: Visualization of attention maps in ViT-T using our impulse initialization method, mimetic trockman2023mimetic, and random liu2022convnet initializations. Red boxes highlight zoomed-in details of the $48{\times}48$ upper left corner in attention maps. White boxes indicate the main diagonal blocks of the zoomed-in attention maps. Our structured initialization method offers off-diagonal attention peaks aligned with the impulse structures, whereas mimetic initialization primarily strengthens the main diagonal of the attention map. Random initialization shows little to no patterns.
  • Figure 5: Visualization of attention maps in ViT-T using our impulse initialization method, mimetic trockman2023mimetic, and random liu2022convnet initializations. Red boxes highlight zoomed-in details of the $48{\times}48$ upper left corner in attention maps. White boxes indicate the main diagonal blocks of the zoomed-in attention maps. Our structured initialization method offers off-diagonal attention peaks aligned with the impulse structures, whereas mimetic initialization primarily strengthens the main diagonal of the attention map. Random initialization shows little to no patterns.

Theorems & Definitions (2)

  • remark thmcounterremark
  • Proposition 1