Table of Contents
Fetching ...

LookHere: Vision Transformers with Directed Attention Generalize and Extrapolate

Anthony Fuller, Daniel G. Kyrollos, Yousef Yassin, James R. Green

TL;DR

This work proposes a drop-in replacement for the position encoding of plain ViTs that restricts attention heads to fixed fields of view, pointed in different directions, using 2D attention masks, called LookHere, which provides translation-equivariance, ensures attention head diversity, and limits the distribution shift that attention heads face when extrapolating.

Abstract

High-resolution images offer more information about scenes that can improve model accuracy. However, the dominant model architecture in computer vision, the vision transformer (ViT), cannot effectively leverage larger images without finetuning -- ViTs poorly extrapolate to more patches at test time, although transformers offer sequence length flexibility. We attribute this shortcoming to the current patch position encoding methods, which create a distribution shift when extrapolating. We propose a drop-in replacement for the position encoding of plain ViTs that restricts attention heads to fixed fields of view, pointed in different directions, using 2D attention masks. Our novel method, called LookHere, provides translation-equivariance, ensures attention head diversity, and limits the distribution shift that attention heads face when extrapolating. We demonstrate that LookHere improves performance on classification (avg. 1.6%), against adversarial attack (avg. 5.4%), and decreases calibration error (avg. 1.5%) -- on ImageNet without extrapolation. With extrapolation, LookHere outperforms the current SoTA position encoding method, 2D-RoPE, by 21.7% on ImageNet when trained at $224^2$ px and tested at $1024^2$ px. Additionally, we release a high-resolution test set to improve the evaluation of high-resolution image classifiers, called ImageNet-HR.

LookHere: Vision Transformers with Directed Attention Generalize and Extrapolate

TL;DR

This work proposes a drop-in replacement for the position encoding of plain ViTs that restricts attention heads to fixed fields of view, pointed in different directions, using 2D attention masks, called LookHere, which provides translation-equivariance, ensures attention head diversity, and limits the distribution shift that attention heads face when extrapolating.

Abstract

High-resolution images offer more information about scenes that can improve model accuracy. However, the dominant model architecture in computer vision, the vision transformer (ViT), cannot effectively leverage larger images without finetuning -- ViTs poorly extrapolate to more patches at test time, although transformers offer sequence length flexibility. We attribute this shortcoming to the current patch position encoding methods, which create a distribution shift when extrapolating. We propose a drop-in replacement for the position encoding of plain ViTs that restricts attention heads to fixed fields of view, pointed in different directions, using 2D attention masks. Our novel method, called LookHere, provides translation-equivariance, ensures attention head diversity, and limits the distribution shift that attention heads face when extrapolating. We demonstrate that LookHere improves performance on classification (avg. 1.6%), against adversarial attack (avg. 5.4%), and decreases calibration error (avg. 1.5%) -- on ImageNet without extrapolation. With extrapolation, LookHere outperforms the current SoTA position encoding method, 2D-RoPE, by 21.7% on ImageNet when trained at px and tested at px. Additionally, we release a high-resolution test set to improve the evaluation of high-resolution image classifiers, called ImageNet-HR.
Paper Structure (23 sections, 1 equation, 35 figures, 13 tables)

This paper contains 23 sections, 1 equation, 35 figures, 13 tables.

Figures (35)

  • Figure 1: ViT-B/$16$ models trained for $150$ epochs on ImageNet at $224^2$ px and tested up to $1024^2$ px. Model architectures are consistent between runs other than position encoding methods. We perform an $8$-run hyperparameter sweep, per method, to ensure fair comparisons. Our three LookHere variants improve extrapolation ability, with more narrow fields of view performing best at $1024^2$.
  • Figure 2: LookHere masks and biases (center) the learned attention matrix (left, where colors are random). Masked cells are black, encoding directions ($\rightarrow \text{with a } 90\degree$ FOV); biased cells are shaded bluish-green, encoding relative patch distances. (Right) An example of the FOV of the center query patch. The final attention matrix is computed as $\mathcal{A}^l = \texttt{softmax}(\mathcal{A}_{\text{LRN}}^l - \mathcal{A}_{\text{FIX}}^l)$, at each layer $l$.
  • Figure 3: Images of three classes from ImageNet-HR. (Bottom left is Anthony's niece Addison.)
  • Figure 4: LookHere learns more diverse attention heads and prevents attention collapse. Legend follows Figures \ref{['fig:main_figure']}\ref{['fig:object_size']}.
  • Figure 5: We apply frozen MLP classifying heads (learned on the CLS token) on frozen patch representations. We visualize ImageNet class predictions: assault rifle (red), bulletproof vest (green), crash helmet (blue), and holster (white). In parentheses, we show mIoU results (@224px) on ImageNet-S gao2022luss, where we apply this technique to segment images without training.
  • ...and 30 more figures