Table of Contents
Fetching ...

S2AFormer: Strip Self-Attention for Efficient Vision Transformer

Guoan Xu, Wenfeng Huang, Wenjing Jia, Jiamao Li, Guangwei Gao, Guo-Jun Qi

TL;DR

S2AFormer tackles the quadratic complexity of Vision Transformers by introducing Strip Self-Attention (SSA) and Hybrid Perception Blocks (HPBs) that fuse CNN local perception with global context. The architecture uses a four-stage hierarchy with a Local Interaction Module (LIM) to preserve boundary details and rotation/translation robustness, achieving substantial efficiency gains. Across ImageNet-1K, ADE20K, and COCO benchmarks, S2AFormer demonstrates competitive or superior accuracy with lower MACs and faster inference on GPUs and non-GPU platforms. Ablation studies confirm the value of LIM and convolution-based spatial reduction, and the work highlights practical deployment potential for efficient vision transformers.

Abstract

Vision Transformer (ViT) has made significant advancements in computer vision, thanks to its token mixer's sophisticated ability to capture global dependencies between all tokens. However, the quadratic growth in computational demands as the number of tokens increases limits its practical efficiency. Although recent methods have combined the strengths of convolutions and self-attention to achieve better trade-offs, the expensive pairwise token affinity and complex matrix operations inherent in self-attention remain a bottleneck. To address this challenge, we propose S2AFormer, an efficient Vision Transformer architecture featuring novel Strip Self-Attention (SSA). We design simple yet effective Hybrid Perception Blocks (HPBs) to effectively integrate the local perception capabilities of CNNs with the global context modeling of Transformer's attention mechanisms. A key innovation of SSA lies in its reduction of the spatial dimensions of $K$ and $V$, while compressing the channel dimensions of $Q$ and $K$. This design significantly reduces computational overhead while preserving accuracy, striking an optimal balance between efficiency and effectiveness. We evaluate the robustness and efficiency of S2AFormer through extensive experiments on multiple vision benchmarks, including ImageNet-1k for image classification, ADE20k for semantic segmentation, and COCO for object detection and instance segmentation. Results demonstrate that S2AFormer achieves significant accuracy gains with superior efficiency in both GPU and non-GPU environments, making it a strong candidate for efficient vision Transformers.

S2AFormer: Strip Self-Attention for Efficient Vision Transformer

TL;DR

S2AFormer tackles the quadratic complexity of Vision Transformers by introducing Strip Self-Attention (SSA) and Hybrid Perception Blocks (HPBs) that fuse CNN local perception with global context. The architecture uses a four-stage hierarchy with a Local Interaction Module (LIM) to preserve boundary details and rotation/translation robustness, achieving substantial efficiency gains. Across ImageNet-1K, ADE20K, and COCO benchmarks, S2AFormer demonstrates competitive or superior accuracy with lower MACs and faster inference on GPUs and non-GPU platforms. Ablation studies confirm the value of LIM and convolution-based spatial reduction, and the work highlights practical deployment potential for efficient vision transformers.

Abstract

Vision Transformer (ViT) has made significant advancements in computer vision, thanks to its token mixer's sophisticated ability to capture global dependencies between all tokens. However, the quadratic growth in computational demands as the number of tokens increases limits its practical efficiency. Although recent methods have combined the strengths of convolutions and self-attention to achieve better trade-offs, the expensive pairwise token affinity and complex matrix operations inherent in self-attention remain a bottleneck. To address this challenge, we propose S2AFormer, an efficient Vision Transformer architecture featuring novel Strip Self-Attention (SSA). We design simple yet effective Hybrid Perception Blocks (HPBs) to effectively integrate the local perception capabilities of CNNs with the global context modeling of Transformer's attention mechanisms. A key innovation of SSA lies in its reduction of the spatial dimensions of and , while compressing the channel dimensions of and . This design significantly reduces computational overhead while preserving accuracy, striking an optimal balance between efficiency and effectiveness. We evaluate the robustness and efficiency of S2AFormer through extensive experiments on multiple vision benchmarks, including ImageNet-1k for image classification, ADE20k for semantic segmentation, and COCO for object detection and instance segmentation. Results demonstrate that S2AFormer achieves significant accuracy gains with superior efficiency in both GPU and non-GPU environments, making it a strong candidate for efficient vision Transformers.

Paper Structure

This paper contains 21 sections, 13 equations, 8 figures, 9 tables, 1 algorithm.

Figures (8)

  • Figure 1: Comparison of our proposed S2AFormer model with SOTA methods, including EMOv2 zhang2025emov2, SwiftFormer shaker2023swiftformer, and iFormer zheng2025iformer. (Top-1 accuracy v.s. MACs on ImageNet-1k deng2009imagenet). In theory, the optimal performance is in the upper-left of the plot, which means higher top-1 accuracy and fewer MACs.
  • Figure 2: Visualization of perception fields in different strategies. Red stars represent the current positions, and the black areas represent the regions that the current position cannot perceive. Convolutional networks cannot model long-range contexts. Global self-attention establishes global receptive fields at the cost of quadratic computational complexity. SSA integrated with local interaction prioritizes semantically salient regions globally while suppressing redundant spatial correlations.
  • Figure 3: Overview of our proposed S2AFormer. Similar to liu2021swinwang2021pyramid, we employ a hierarchical architecture with four stages, each containing $L_i$ Hybrid Perception Blocks. Table \ref{['network_config']} provides the detailed network configurations of S2AFormer variants.
  • Figure 4: Comparison of different self-attention mechanisms: (a) Vanilla self-attention in ViTs dosovitskiy2020image, which computes global attention using standard dot-product operations. (b) Separable self-attention in MobileViT-v2 mehta2022separable, which applies element-wise operations on query ($Q$) and key ($K$) to form a context vector. (c) Swift self-attention in SwiftFormer shaker2023swiftformer, where $Q$ is weighted and pooled into global queries, broadcast, and multiplied element-wise with $K$ to generate global context. (d) Convolutional additive self-attention in CAS-ViT zhang2024cas, which replaces the global dot-product with a cascaded design applying spatial attention followed by channel attention. (e) Our proposed strip self-attention jointly compresses spatial and channel dimensions to effectively eliminate redundant information, achieving a lightweight design while preserving dense global dependencies.
  • Figure 5: Visualization of effective respective fields (ERFs) across different models. Convolution-based models (a), (b), and (c) exhibit highly localized receptive fields, while Vanilla ViT (d) distributes attention broadly across all spatial positions. Model (e) demonstrates more limited receptive fields compared to ours. In contrast, our method (f) achieves a balanced pattern—effectively capturing key local regions and progressively expanding outward in a strip-like manner.
  • ...and 3 more figures