Table of Contents
Fetching ...

Towards Better Multi-head Attention via Channel-wise Sample Permutation

Shen Yuan, Hongteng Xu

TL;DR

A simple and novel channel-wise sample permutation (CSP) operator is proposed, achieving a new structured MHA with fewer parameters and lower complexity than the classic Transformer and its state-of-the-art variants.

Abstract

Transformer plays a central role in many fundamental deep learning models, e.g., the ViT in computer vision and the BERT and GPT in natural language processing, whose effectiveness is mainly attributed to its multi-head attention (MHA) mechanism. In this study, we propose a simple and novel channel-wise sample permutation (CSP) operator, achieving a new structured MHA with fewer parameters and lower complexity. Given an input matrix, CSP circularly shifts the samples of different channels with various steps and then sorts grouped samples of each channel. This operator is equivalent to implicitly implementing cross-channel attention maps as permutation matrices, which achieves linear complexity and suppresses the risk of rank collapse when representing data. We replace the MHA of some representative models with CSP and test the CSP-based models in several discriminative tasks, including image classification and long sequence analysis. Experiments show that the CSP-based models achieve comparable or better performance with fewer parameters and lower computational costs than the classic Transformer and its state-of-the-art variants. The code is available at https://github.com/DaShenZi721/CSP.

Towards Better Multi-head Attention via Channel-wise Sample Permutation

TL;DR

A simple and novel channel-wise sample permutation (CSP) operator is proposed, achieving a new structured MHA with fewer parameters and lower complexity than the classic Transformer and its state-of-the-art variants.

Abstract

Transformer plays a central role in many fundamental deep learning models, e.g., the ViT in computer vision and the BERT and GPT in natural language processing, whose effectiveness is mainly attributed to its multi-head attention (MHA) mechanism. In this study, we propose a simple and novel channel-wise sample permutation (CSP) operator, achieving a new structured MHA with fewer parameters and lower complexity. Given an input matrix, CSP circularly shifts the samples of different channels with various steps and then sorts grouped samples of each channel. This operator is equivalent to implicitly implementing cross-channel attention maps as permutation matrices, which achieves linear complexity and suppresses the risk of rank collapse when representing data. We replace the MHA of some representative models with CSP and test the CSP-based models in several discriminative tasks, including image classification and long sequence analysis. Experiments show that the CSP-based models achieve comparable or better performance with fewer parameters and lower computational costs than the classic Transformer and its state-of-the-art variants. The code is available at https://github.com/DaShenZi721/CSP.

Paper Structure

This paper contains 17 sections, 2 theorems, 20 equations, 4 figures, 7 tables.

Key Result

Theorem 1

Suppose that we construct a layer-$L$ network as $(f\circ\text{CSP})^L=(f_{\lambda_L}\circ\text{CSP}_{\bm{W}^{(L)}})\circ\cdots\circ(f_{\lambda_1}\circ\text{CSP}_{\bm{W}^{(1)}})$. For $\ell=1,...,L$, $\text{CSP}_{\bm{W}^{(\ell)}}$ is a $C$-channel CSP operator, and $f_{\lambda_{\ell}}:\mathbb{R}^{C}

Figures (4)

  • Figure 1: An illustration of the proposed channel-wise sample permutation operator and the equivalent implicit cross-channel attention maps.
  • Figure 2: The shifting strategies when $N\approx C$ and $N\gg C$.
  • Figure 3: The singular spectrums of the output matrices achieved on ImageNet-1k.
  • Figure 4: The performance and efficiency of various models on the LRA benchmark. The disk area indicates the memory cost of each method.

Theorems & Definitions (2)

  • Theorem 1
  • Theorem 2