Table of Contents
Fetching ...

Replacing softmax with ReLU in Vision Transformers

Mitchell Wortsman, Jaehoon Lee, Justin Gilmer, Simon Kornblith

TL;DR

This work investigates replacing softmax in vision transformer attention with point-wise activations, focusing on ReLU divided by the sequence length as a key scaling factor. The proposed ReLU-attention, particularly with $\phi = L^{-1}\mathsf{relu}$, is designed to enable parallelization along the sequence and to preserve favorable scaling behavior with compute. Through large-scale experiments on ImageNet-21k and ImageNet-1k, the study shows that ReLU-attention can match or approach the scaling of traditional softmax attention, with sequence-length scaling near $\alpha\approx1$ yielding strong accuracy across ViT sizes, and gating not eliminating the need for length-based scaling. The findings open avenues for faster attention implementations in Vision Transformers and guide future exploration of activation functions and normalization in attention mechanisms.

Abstract

Previous research observed accuracy degradation when replacing the attention softmax with a point-wise activation such as ReLU. In the context of vision transformers, we find that this degradation is mitigated when dividing by sequence length. Our experiments training small to large vision transformers on ImageNet-21k indicate that ReLU-attention can approach or match the performance of softmax-attention in terms of scaling behavior as a function of compute.

Replacing softmax with ReLU in Vision Transformers

TL;DR

This work investigates replacing softmax in vision transformer attention with point-wise activations, focusing on ReLU divided by the sequence length as a key scaling factor. The proposed ReLU-attention, particularly with , is designed to enable parallelization along the sequence and to preserve favorable scaling behavior with compute. Through large-scale experiments on ImageNet-21k and ImageNet-1k, the study shows that ReLU-attention can match or approach the scaling of traditional softmax attention, with sequence-length scaling near yielding strong accuracy across ViT sizes, and gating not eliminating the need for length-based scaling. The findings open avenues for faster attention implementations in Vision Transformers and guide future exploration of activation functions and normalization in attention mechanisms.

Abstract

Previous research observed accuracy degradation when replacing the attention softmax with a point-wise activation such as ReLU. In the context of vision transformers, we find that this degradation is mitigated when dividing by sequence length. Our experiments training small to large vision transformers on ImageNet-21k indicate that ReLU-attention can approach or match the performance of softmax-attention in terms of scaling behavior as a function of compute.
Paper Structure (5 sections, 1 equation, 4 figures)

This paper contains 5 sections, 1 equation, 4 figures.

Figures (4)

  • Figure 1: Replacing $\mathsf{softmax}$ with $\mathsf{relu}/\mathsf{seqlen}$ approaches or matches the scaling performance of traditional attention for vision transformers dosovitskiy2021an with qk-layernorm dehghani2023scaling. This figure displays results for small to large vision transformers trained on ImageNet-21k deng2009imagenet for 30 epochs. We report ImageNet-1k accuracy for ImageNet-21k models by taking the top class among those that are in ImageNet-1k, without fine-tuning. Attention with ReLU can be parallelized over the sequence length dimension with less gather operations than softmax attention.
  • Figure 2: Replacing softmax with $L^{-\alpha} h$ where $h \in \{\mathsf{relu}, \mathsf{relu}^2, \mathsf{gelu}, \mathsf{softplus}, \mathsf{identity}, \mathsf{relu6}, \mathsf{sigmoid}\}$ and $L$ is sequence length. We typically observe the best results when $\alpha$ is close to 1. There is no clear best non-linearity at $\alpha \approx 1$, so we use ReLU in our main experiment for its speed.
  • Figure 3: The effect of removing qk-layernorm dehghani2023scaling on attention with ReLU and squared ReLU scaled by $L^{-\alpha}$ where $L$ is sequence length. Results are shown for the S/32, S/16, and S/8 vision transformer models dosovitskiy2021anvit_baseline trained on ImageNet-21k.
  • Figure 4: The effect of using a gated attention unit hua2022transformer on attention with ReLU and squared ReLU scaled by $L^{-\alpha}$ where $L$ is sequence length. Results are shown for the S/32, S/16, and S/8 vision transformer models dosovitskiy2021anvit_baseline trained on ImageNet-21k.