Table of Contents
Fetching ...

Improving Vision Transformers by Overlapping Heads in Multi-Head Self-Attention

Tianxiao Zhang, Bo Luo, Guanghui Wang

TL;DR

This work addresses the limited inter-head information exchange in Vision Transformers by introducing Multi-Overlapped-Head Self-Attention (MOHSA), which overlappingly combines $Q$, $K$, and $V$ across neighboring heads during attention and uses zero padding for end heads. The method is formalized with $Q_i'$, $K_i'$, and $V_i'$ constructed by concatenating parts from adjacent heads, followed by head-wise Attention and a projection $W'$ to restore original token dimensions; multiple overlap-ratio schemes are explored to balance accuracy and overhead. Empirically, MOHSA yields consistent improvements across ViT, CaiT, and Swin-Tiny models on CIFAR-10/100, Tiny-ImageNet, and ImageNet, with notable gains on CaiT variants (e.g., up to +5% on CIFAR-100) and relatively modest overhead. The findings suggest MOHSA as a versatile, plug-in enhancement for Vision Transformers, particularly effective when information exchange between heads is beneficial, and motivate further study of head-interaction architectures in deep vision models.

Abstract

Vision Transformers have made remarkable progress in recent years, achieving state-of-the-art performance in most vision tasks. A key component of this success is due to the introduction of the Multi-Head Self-Attention (MHSA) module, which enables each head to learn different representations by applying the attention mechanism independently. In this paper, we empirically demonstrate that Vision Transformers can be further enhanced by overlapping the heads in MHSA. We introduce Multi-Overlapped-Head Self-Attention (MOHSA), where heads are overlapped with their two adjacent heads for queries, keys, and values, while zero-padding is employed for the first and last heads, which have only one neighboring head. Various paradigms for overlapping ratios are proposed to fully investigate the optimal performance of our approach. The proposed approach is evaluated using five Transformer models on four benchmark datasets and yields a significant performance boost. The source code will be made publicly available upon publication.

Improving Vision Transformers by Overlapping Heads in Multi-Head Self-Attention

TL;DR

This work addresses the limited inter-head information exchange in Vision Transformers by introducing Multi-Overlapped-Head Self-Attention (MOHSA), which overlappingly combines , , and across neighboring heads during attention and uses zero padding for end heads. The method is formalized with , , and constructed by concatenating parts from adjacent heads, followed by head-wise Attention and a projection to restore original token dimensions; multiple overlap-ratio schemes are explored to balance accuracy and overhead. Empirically, MOHSA yields consistent improvements across ViT, CaiT, and Swin-Tiny models on CIFAR-10/100, Tiny-ImageNet, and ImageNet, with notable gains on CaiT variants (e.g., up to +5% on CIFAR-100) and relatively modest overhead. The findings suggest MOHSA as a versatile, plug-in enhancement for Vision Transformers, particularly effective when information exchange between heads is beneficial, and motivate further study of head-interaction architectures in deep vision models.

Abstract

Vision Transformers have made remarkable progress in recent years, achieving state-of-the-art performance in most vision tasks. A key component of this success is due to the introduction of the Multi-Head Self-Attention (MHSA) module, which enables each head to learn different representations by applying the attention mechanism independently. In this paper, we empirically demonstrate that Vision Transformers can be further enhanced by overlapping the heads in MHSA. We introduce Multi-Overlapped-Head Self-Attention (MOHSA), where heads are overlapped with their two adjacent heads for queries, keys, and values, while zero-padding is employed for the first and last heads, which have only one neighboring head. Various paradigms for overlapping ratios are proposed to fully investigate the optimal performance of our approach. The proposed approach is evaluated using five Transformer models on four benchmark datasets and yields a significant performance boost. The source code will be made publicly available upon publication.

Paper Structure

This paper contains 15 sections, 6 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: The proposed multi-overlapped-head method (blue) vs the original multi-head method (green). Instead of hard division of the heads, our approach softly splits the heads by overlapping each head with its neighboring heads.
  • Figure 2: The Transformer encoder for a typical Transformer model. The Transformer encoder is exploited in Vision Transformer for image classification. $N$ indicates the number of layers for the Transformer encoder.
  • Figure 3: Our proposed Multi-Overlapped-Head Self-Attention. MHSA represents the original implementation of Multi-Head Self-Attention with hard division of heads and MOHSA indicates our proposed Multi-Overlapped-Head Self-Attention with soft division of heads. In the original Vision Transformer (left), $Q$, $K$, and $V$ are split for different heads and the attention is computed for each head independently. To exchange the information between heads when the attention is calculated, we propose to overlap $Q$, $K$, $V$ with $Q$, $K$, and $V$ in adjacent heads (right). Since overlapped heads would slightly increase the number of dimensions, the projection matrix would project the concatenated heads to the original token dimension.
  • Figure 4: The illustration of the overlap dimensions. The blue parts demonstrate the original non-overlapping heads and the red parts indicate the overlapped parts from adjacent heads. The number of overlap dimensions is the overlap dimension of one side adjacent head.
  • Figure 5: The accuracy comparison during the training process. The accuracy comparison between the original method and our proposed approach for CaiT-xxs24 on val or test set of CIFAR-100, Tiny-ImageNet, and ImageNet is illustrated from left to right. The blue curve demonstrates our approach with the best performance and the red curve indicates the original method.