Table of Contents
Fetching ...

Softmax-free Linear Transformers

Jiachen Lu, Junge Zhang, Xiatian Zhu, Jianfeng Feng, Tao Xiang, Li Zhang

TL;DR

Vision Transformers are limited by quadratic self-attention complexity, hindering high-resolution visual tasks. The authors introduce Softmax-Free Transformers (SOFT), which replace softmax-based similarity with a Gaussian kernel and employ Nyström low-rank decomposition plus Newton-Raphson-based Moore-Penrose inverse to achieve linear time/space, augmented by a symmetric normalization (SOFT++) to support dense predictions. The approach yields a theoretically grounded, scalable attention mechanism, with extensive experiments on ImageNet, COCO, and ADE20K demonstrating strong accuracy/efficiency trade-offs, and cross-domain validation on Long Range Arena for NLP tasks. Together, these contributions enable longer token sequences and dense vision tasks with competitive performance, offering a practical path toward scalable vision-language transformers.

Abstract

Vision transformers (ViTs) have pushed the state-of-the-art for visual perception tasks. The self-attention mechanism underpinning the strength of ViTs has a quadratic complexity in both computation and memory usage. This motivates the development of approximating the self-attention at linear complexity. However, an in-depth analysis in this work reveals that existing methods are either theoretically flawed or empirically ineffective for visual recognition. We identify that their limitations are rooted in the inheritance of softmax-based self-attention during approximations, that is, normalizing the scaled dot-product between token feature vectors using the softmax function. As preserving the softmax operation challenges any subsequent linearization efforts. By this insight, a family of Softmax-Free Transformers (SOFT) are proposed. Specifically, a Gaussian kernel function is adopted to replace the dot-product similarity, enabling a full self-attention matrix to be approximated under low-rank matrix decomposition. For computational robustness, we estimate the Moore-Penrose inverse using an iterative Newton-Raphson method in the forward process only, while calculating its theoretical gradients only once in the backward process. To further expand applicability (e.g., dense prediction tasks), an efficient symmetric normalization technique is introduced. Extensive experiments on ImageNet, COCO, and ADE20K show that our SOFT significantly improves the computational efficiency of existing ViT variants. With linear complexity, much longer token sequences are permitted by SOFT, resulting in superior trade-off between accuracy and complexity. Code and models are available at https://github.com/fudan-zvg/SOFT.

Softmax-free Linear Transformers

TL;DR

Vision Transformers are limited by quadratic self-attention complexity, hindering high-resolution visual tasks. The authors introduce Softmax-Free Transformers (SOFT), which replace softmax-based similarity with a Gaussian kernel and employ Nyström low-rank decomposition plus Newton-Raphson-based Moore-Penrose inverse to achieve linear time/space, augmented by a symmetric normalization (SOFT++) to support dense predictions. The approach yields a theoretically grounded, scalable attention mechanism, with extensive experiments on ImageNet, COCO, and ADE20K demonstrating strong accuracy/efficiency trade-offs, and cross-domain validation on Long Range Arena for NLP tasks. Together, these contributions enable longer token sequences and dense vision tasks with competitive performance, offering a practical path toward scalable vision-language transformers.

Abstract

Vision transformers (ViTs) have pushed the state-of-the-art for visual perception tasks. The self-attention mechanism underpinning the strength of ViTs has a quadratic complexity in both computation and memory usage. This motivates the development of approximating the self-attention at linear complexity. However, an in-depth analysis in this work reveals that existing methods are either theoretically flawed or empirically ineffective for visual recognition. We identify that their limitations are rooted in the inheritance of softmax-based self-attention during approximations, that is, normalizing the scaled dot-product between token feature vectors using the softmax function. As preserving the softmax operation challenges any subsequent linearization efforts. By this insight, a family of Softmax-Free Transformers (SOFT) are proposed. Specifically, a Gaussian kernel function is adopted to replace the dot-product similarity, enabling a full self-attention matrix to be approximated under low-rank matrix decomposition. For computational robustness, we estimate the Moore-Penrose inverse using an iterative Newton-Raphson method in the forward process only, while calculating its theoretical gradients only once in the backward process. To further expand applicability (e.g., dense prediction tasks), an efficient symmetric normalization technique is introduced. Extensive experiments on ImageNet, COCO, and ADE20K show that our SOFT significantly improves the computational efficiency of existing ViT variants. With linear complexity, much longer token sequences are permitted by SOFT, resulting in superior trade-off between accuracy and complexity. Code and models are available at https://github.com/fudan-zvg/SOFT.
Paper Structure (21 sections, 8 theorems, 52 equations, 7 figures, 12 tables, 2 algorithms)

This paper contains 21 sections, 8 theorems, 52 equations, 7 figures, 12 tables, 2 algorithms.

Key Result

proposition thmcounterproposition

When $\alpha$ is sufficiently small, $A_{k+1}=2A_k-A_k A A_k$, $A_k$ converges to $A^{\dagger}$.

Figures (7)

  • Figure 1: Top-1 classification accuracy on ImageNet deng2009imagenet validation set with respect to parameters and the memory usage corresponding to the token sequence length in practice compared to other methods. (a) Comparison with CNN models: ResNet he2016deep and CoAtNet dai2021coatnet Transformer models: PVT wang2021pyramid, Swin liu2021swin, DeiT touvron2021training, ViT dosovitskiy2020image, T2T-ViT yuan2021tokens, Twins-SVT chu2021twins and SAN10 zhao2020exploring; (b) Comparison with Transformer vaswani2017attention, Linformer wang2020linformer, Nyströformer xiong2021nystr and Performer choromanski2020rethinking. The memory usage is measured with a batch size of 1 on a 16GB Tesla V100.
  • Figure 2: Schematic illustration of the proposed softmax-free self-attention (SOFT) method. P.E.: Position embedding. Dash lines: linear projection. dh: the hidden dim of each attention head. $\circ$ denotes the matrix dot product.
  • Figure 3: A comparison of Top-1 classification accuracy on the ImageNet validation set deng2009imagenet with respect to inference throughput for various models. Our comparison includes CNN models such as ConvNext liu2022convnet, as well as Transformer models like Swin liu2021swin. In this comparison, models positioned closer to the top-right indicate superior performance, balancing both accuracy and throughput effectively. Inference throughput is measured on a V100 GPU, following liu2021swinliu2022convnet.
  • Figure 4: Comparison of attention heatmaps for a selected query patch (indicated by a cross "+") against all patches in an image. Heatmaps are derived from the first head's corresponding row in the attention maps, as calculated by Equation \ref{['eq:reg_norm_attn']}. These heatmaps are normalized to a 0-1 scale, with warmer colors indicating higher relevance. The model variants compared are: (a) Transformer vaswani2017attention, (b) Performer choromanski2020rethinking, (c) Nystromformer xiong2021nystr, and (d) Our SOFT approach. For additional examples, refer to Appendix \ref{['sec:attn_vis']}.
  • Figure 5: Convergence analysis for the approximation of Moore-Penrose inverse on ImageNet, COCO and ADE20K separately. SOFT-Tiny is used. We measure $\|AA_kA-A\|_p/\|A\|_2$ for 100 input images on each dataset. The solid line shows the average convergence metric, while the shallow area indicates the upper bound and lower bound.
  • ...and 2 more figures

Theorems & Definitions (17)

  • proposition thmcounterproposition
  • proposition thmcounterproposition
  • proof
  • proposition thmcounterproposition
  • proof
  • proposition thmcounterproposition
  • proof
  • proposition thmcounterproposition
  • proposition thmcounterproposition
  • proof
  • ...and 7 more