Table of Contents
Fetching ...

On the Surprising Effectiveness of Attention Transfer for Vision Transformers

Alexander C. Li, Yuandong Tian, Beidi Chen, Deepak Pathak, Xinlei Chen

TL;DR

It is found that the features and representations learned during pre-training are not essential, and using only the attention patterns from pre-training is sufficient for models to learn high quality features from scratch and achieve comparable downstream performance.

Abstract

Conventional wisdom suggests that pre-training Vision Transformers (ViT) improves downstream performance by learning useful representations. Is this actually true? We investigate this question and find that the features and representations learned during pre-training are not essential. Surprisingly, using only the attention patterns from pre-training (i.e., guiding how information flows between tokens) is sufficient for models to learn high quality features from scratch and achieve comparable downstream performance. We show this by introducing a simple method called attention transfer, where only the attention patterns from a pre-trained teacher ViT are transferred to a student, either by copying or distilling the attention maps. Since attention transfer lets the student learn its own features, ensembling it with a fine-tuned teacher also further improves accuracy on ImageNet. We systematically study various aspects of our findings on the sufficiency of attention maps, including distribution shift settings where they underperform fine-tuning. We hope our exploration provides a better understanding of what pre-training accomplishes and leads to a useful alternative to the standard practice of fine-tuning

On the Surprising Effectiveness of Attention Transfer for Vision Transformers

TL;DR

It is found that the features and representations learned during pre-training are not essential, and using only the attention patterns from pre-training is sufficient for models to learn high quality features from scratch and achieve comparable downstream performance.

Abstract

Conventional wisdom suggests that pre-training Vision Transformers (ViT) improves downstream performance by learning useful representations. Is this actually true? We investigate this question and find that the features and representations learned during pre-training are not essential. Surprisingly, using only the attention patterns from pre-training (i.e., guiding how information flows between tokens) is sufficient for models to learn high quality features from scratch and achieve comparable downstream performance. We show this by introducing a simple method called attention transfer, where only the attention patterns from a pre-trained teacher ViT are transferred to a student, either by copying or distilling the attention maps. Since attention transfer lets the student learn its own features, ensembling it with a fine-tuned teacher also further improves accuracy on ImageNet. We systematically study various aspects of our findings on the sufficiency of attention maps, including distribution shift settings where they underperform fine-tuning. We hope our exploration provides a better understanding of what pre-training accomplishes and leads to a useful alternative to the standard practice of fine-tuning

Paper Structure

This paper contains 41 sections, 3 equations, 12 figures, 16 tables.

Figures (12)

  • Figure 1: Using only attention is sufficient for full performance. By copying the attention maps (top) from a MAE he2022masked pre-trained ViT-L dosovitskiy2020image, a ViT-L can reach a top-1 accuracy of 85.1 on ImageNet-1K Deng2009 -- recovering 77.8% of the gap between no transfer (training from scratch, 83.0) and full transfer (fine-tuning all the weights, 85.7). Distilling attention maps (bottom) can even fully match MAE weight tuning while only transferring the inter-token flow.
  • Figure 2: Two types of Attention transfer for Vision Transformers. Attention Copy (left): We simply "copy-and-paste" the attention maps from a pre-trained teacher model to a randomly initialized student one. Other weights of the student are then trained via supervised learning. This fully decouples inter-token learning (from the teacher) and intra-token learning (in the student); but is less practical. Attention Distillation (right): The student computes its own attention maps, with an additional cross-entropy loss to distill patterns from the teacher during training. The teacher is no longer used during inference. $H$: number of heads; $L$: number of Transformer layers.
  • Figure 3: Copy a subset of layers. By default, all 24 ViT-L layers are transferred. Here we only transfer a subset, and find: more layers always helps; and attention maps from top layers are more beneficial than those from bottom layers.
  • Figure 4: Copy a subset of heads. The pre-trained ViT-L has 16 heads in each MSA block. By default, all of them are transferred. Here we only transfer a subset, and find more heads helps in general, but performance saturates at 12 heads.
  • Figure 5: CKA representation similarity to the fine-tuned model. We use CKA kornblith2019similarity to measure the layer-wise similarity between representations learned in different models against the fine-tuned MAE model. Higher means more similar. We find that attention transfer methods are quite dissimlar to the fine-tuned model, with roughly the same CKA as an independent scratch model.
  • ...and 7 more figures