Table of Contents
Fetching ...

Refiner: Refining Self-attention for Vision Transformers

Daquan Zhou, Yujun Shi, Bingyi Kang, Weihao Yu, Zihang Jiang, Yuan Li, Xiaojie Jin, Qibin Hou, Jiashi Feng

TL;DR

This work tackles ViT data-efficiency by directly refining self-attention maps through attention expansion and distributed local attention, combining global and local context within a simple, drop-in module. The Refiner yields consistent improvements on ImageNet and GLUE, achieving near-state-of-the-art results with under 100M parameters and even enabling competitive NLP gains (e.g., BERT-small on GLUE). It demonstrates that diversifying and localizing attention can accelerate feature evolution and enhance discriminability, offering a practical route to more data-efficient transformers. Additionally, receptive-field calibration presents a generic, lightweight boost for both CNNs and ViTs, highlighting the importance of alignment between input regions and model receptive fields.

Abstract

Vision Transformers (ViTs) have shown competitive accuracy in image classification tasks compared with CNNs. Yet, they generally require much more data for model pre-training. Most of recent works thus are dedicated to designing more complex architectures or training methods to address the data-efficiency issue of ViTs. However, few of them explore improving the self-attention mechanism, a key factor distinguishing ViTs from CNNs. Different from existing works, we introduce a conceptually simple scheme, called refiner, to directly refine the self-attention maps of ViTs. Specifically, refiner explores attention expansion that projects the multi-head attention maps to a higher-dimensional space to promote their diversity. Further, refiner applies convolutions to augment local patterns of the attention maps, which we show is equivalent to a distributed local attention features are aggregated locally with learnable kernels and then globally aggregated with self-attention. Extensive experiments demonstrate that refiner works surprisingly well. Significantly, it enables ViTs to achieve 86% top-1 classification accuracy on ImageNet with only 81M parameters.

Refiner: Refining Self-attention for Vision Transformers

TL;DR

This work tackles ViT data-efficiency by directly refining self-attention maps through attention expansion and distributed local attention, combining global and local context within a simple, drop-in module. The Refiner yields consistent improvements on ImageNet and GLUE, achieving near-state-of-the-art results with under 100M parameters and even enabling competitive NLP gains (e.g., BERT-small on GLUE). It demonstrates that diversifying and localizing attention can accelerate feature evolution and enhance discriminability, offering a practical route to more data-efficient transformers. Additionally, receptive-field calibration presents a generic, lightweight boost for both CNNs and ViTs, highlighting the importance of alignment between input regions and model receptive fields.

Abstract

Vision Transformers (ViTs) have shown competitive accuracy in image classification tasks compared with CNNs. Yet, they generally require much more data for model pre-training. Most of recent works thus are dedicated to designing more complex architectures or training methods to address the data-efficiency issue of ViTs. However, few of them explore improving the self-attention mechanism, a key factor distinguishing ViTs from CNNs. Different from existing works, we introduce a conceptually simple scheme, called refiner, to directly refine the self-attention maps of ViTs. Specifically, refiner explores attention expansion that projects the multi-head attention maps to a higher-dimensional space to promote their diversity. Further, refiner applies convolutions to augment local patterns of the attention maps, which we show is equivalent to a distributed local attention features are aggregated locally with learnable kernels and then globally aggregated with self-attention. Extensive experiments demonstrate that refiner works surprisingly well. Significantly, it enables ViTs to achieve 86% top-1 classification accuracy on ImageNet with only 81M parameters.

Paper Structure

This paper contains 45 sections, 4 equations, 6 figures, 13 tables.

Figures (6)

  • Figure 1: Illustration on our motivation. (a) The input image is regularly partitioned into patches for patch embedding. (b) The token-wise attention maps from vanilla self-attention of ViTs tend to be uniform, and thus they aggregate all the patch embeddings densely and generate overly-similar tokens. (c) Differently, our proposed refiner augments the attention maps into diverse ones with enhanced local patterns, such that they aggregate the token features more selectively and the resulting tokens are distinguishable from each other.
  • Figure 2: The features of ViT evolves slower than ResNet he2016deep and DeiT touvron2020training across the model blocks.
  • Figure 3: (a) Architecture design of refiner. Different from the vanilla self-attention block, the refiner applies linear attention expansion to attention maps output from the softmax operation to increase their number. Then head-wise spatial convolution is applied to augment these expanded attention maps. Finally another linear projection is deployed to reduce the number of attention maps to the original one. Note that $r = H'/H$ is the expansion ratio. (b) Modified transformer block with refiner as a drop-in component.
  • Figure 4: Refiner accelerates feature evolving compared with CNNs, the vanilla ViT and the Deit trained with a more complex scheme.
  • Figure 5: Compared with the attention matrices $A$ from the vanilla SA (top), for deeper blocks, refiner (bottom) strengthens the local patterns of their attention maps, making them less uniform and better model local context.
  • ...and 1 more figures