Table of Contents
Fetching ...

SimA: Simple Softmax-free Attention for Vision Transformers

Soroush Abbasi Koohpayegani, Hamed Pirsiavash

TL;DR

A simple yet effective, Softmaxfree attention block, SimA, which normalizes query and key matrices with simple ℓ1-norm instead of using Softmax layer is introduced, which results in on-par accuracy compared to the SOTA models, without any need for Softmax layer.

Abstract

Recently, vision transformers have become very popular. However, deploying them in many applications is computationally expensive partly due to the Softmax layer in the attention block. We introduce a simple but effective, Softmax-free attention block, SimA, which normalizes query and key matrices with simple $\ell_1$-norm instead of using Softmax layer. Then, the attention block in SimA is a simple multiplication of three matrices, so SimA can dynamically change the ordering of the computation at the test time to achieve linear computation on the number of tokens or the number of channels. We empirically show that SimA applied to three SOTA variations of transformers, DeiT, XCiT, and CvT, results in on-par accuracy compared to the SOTA models, without any need for Softmax layer. Interestingly, changing SimA from multi-head to single-head has only a small effect on the accuracy, which simplifies the attention block further. The code is available here: https://github.com/UCDvision/sima

SimA: Simple Softmax-free Attention for Vision Transformers

TL;DR

A simple yet effective, Softmaxfree attention block, SimA, which normalizes query and key matrices with simple ℓ1-norm instead of using Softmax layer is introduced, which results in on-par accuracy compared to the SOTA models, without any need for Softmax layer.

Abstract

Recently, vision transformers have become very popular. However, deploying them in many applications is computationally expensive partly due to the Softmax layer in the attention block. We introduce a simple but effective, Softmax-free attention block, SimA, which normalizes query and key matrices with simple -norm instead of using Softmax layer. Then, the attention block in SimA is a simple multiplication of three matrices, so SimA can dynamically change the ordering of the computation at the test time to achieve linear computation on the number of tokens or the number of channels. We empirically show that SimA applied to three SOTA variations of transformers, DeiT, XCiT, and CvT, results in on-par accuracy compared to the SOTA models, without any need for Softmax layer. Interestingly, changing SimA from multi-head to single-head has only a small effect on the accuracy, which simplifies the attention block further. The code is available here: https://github.com/UCDvision/sima
Paper Structure (19 sections, 6 equations, 5 figures, 6 tables)

This paper contains 19 sections, 6 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Comparison on Edge devices: We evaluate performance of a single attention block for each model on 3 different devices: Raspberry Pi 4 (Quad core Cortex-A72 @ 1.5GHz), NVIDIA Jetson Nano (Quad-core ARM A57 @ 1.43 GHz), and Apple M1. To measure the effect of $exp(.)$ only, we fix the order of ($QK^TV$) product so that all models have the same dot product complexity. We set $N>D$ for left and $N<D$ for the right plots. We repeat average of the execution time over $1000$ runs. We observe that SimA is faster than other methods, which we believe is due to the increased complexity of $exp(.)$ operation compared to $\ell_1$ normalization on edge devices.
  • Figure 2: Our Simple Attention (SimA): First, we normalize each channel in $Q$ and $K$ with $\ell_1$-norm across the tokens, to get $\hat{Q}$ and $\hat{K}$. Next, we can choose either $(\hat{Q}\hat{K}^T)V$ or $\hat{Q}(\hat{K}^TV)$ depending on the number of input tokens $N$. Compared to XCA and MSA, our method has following benefits: (1) It is free of Softmax, hence it is more efficient. (2) At test time we can dynamically switch between $(\hat{Q}\hat{K}^T)V$ and $\hat{Q}(\hat{K}^TV)$ based on the number of input tokens (e.g., different image resolution).
  • Figure 3: Our method (SimA): Standard attention passes $QK^T$ through Softmax before multiplying with $V$. However, we multiply $\hat{Q}\hat{K}^T$ directly with $V$. Hence, in our case, the magnitude of $\hat{Q}\hat{K}^T$ should identify which tokens are more important (their information flows to the next layers). We show that this magnitude is correlated with the importance of tokens. We extract $\hat{Q}$ and $\hat{K}$ from layer $12$ of transformer. We get $\ell_2$-norm of each token for $\hat{Q}$ and $\hat{K}$, normalize it to range [0,1] and overlay it as a heatmap on the image. We show the same visualization for DeiT in the supplementary for completeness.We provide more examples in the appendix.
  • Figure A1: Effect of Softmax on inference time (GPU): We evaluate performance of each model on a single RTX 8000 GPU with batch size of $8$. When comparing the baseline to our method (SimA), we fix the order of ($QK^TV$) to have the same dot product complexity as the baseline. For example, when comparing with DeiT, if $N>D$, then it is more efficient to do $\hat{Q}(\hat{K}^TV)$ for our method, but we do $(\hat{Q}\hat{K}^T)V$ to have same complexity as DeiT($O(N^2D)$). We do this to solely evaluate the effect of Softmax on the computation time. Left: We fix the token dimension to $384$ and increase the image resolution. At $1536\times 1536$ resolution, DeiT is $58\%$ slower than our method due to the overhead of $exp(.)$ function in Softmax. Right: We fix the resolution and increase the capacity of the model (dimensions of $Q$ and $K$). With $8192$ dimensions, XCiT is $22\%$ slower due to Softmax overhead.
  • Figure A2: Our method (SimA): We extract $\hat{Q}$ and $\hat{K}$ from layer $12$ of transformer. We get $\ell_2$-norm of each token for $\hat{Q}$ and $\hat{K}$, normalize it to range [0,1] and overlay it as a heatmap on the image. Interestingly, magnitude of tokens represent the significance of tokens in our method. Note that all images are randomly selected from MS-COCO test set without any visual inspection or cherry picking.