Replacing softmax with ReLU in Vision Transformers
Mitchell Wortsman, Jaehoon Lee, Justin Gilmer, Simon Kornblith
TL;DR
This work investigates replacing softmax in vision transformer attention with point-wise activations, focusing on ReLU divided by the sequence length as a key scaling factor. The proposed ReLU-attention, particularly with $\phi = L^{-1}\mathsf{relu}$, is designed to enable parallelization along the sequence and to preserve favorable scaling behavior with compute. Through large-scale experiments on ImageNet-21k and ImageNet-1k, the study shows that ReLU-attention can match or approach the scaling of traditional softmax attention, with sequence-length scaling near $\alpha\approx1$ yielding strong accuracy across ViT sizes, and gating not eliminating the need for length-based scaling. The findings open avenues for faster attention implementations in Vision Transformers and guide future exploration of activation functions and normalization in attention mechanisms.
Abstract
Previous research observed accuracy degradation when replacing the attention softmax with a point-wise activation such as ReLU. In the context of vision transformers, we find that this degradation is mitigated when dividing by sequence length. Our experiments training small to large vision transformers on ImageNet-21k indicate that ReLU-attention can approach or match the performance of softmax-attention in terms of scaling behavior as a function of compute.
