Table of Contents
Fetching ...

SeaFormer++: Squeeze-enhanced Axial Transformer for Mobile Visual Recognition

Qiang Wan, Zilong Huang, Jiachen Lu, Gang Yu, Li Zhang

TL;DR

SeaFormer introduces squeeze-enhanced Axial Transformer blocks (SEA attention) to dramatically reduce self-attention cost from traditional Vision Transformers, achieving linear-like scaling with input size for mobile visual recognition. The architecture employs a two-branch design (context and spatial) with a lightweight fusion head, enabling high-resolution segmentation on ARM devices while maintaining competitive accuracy. A multi-resolution distillation framework aligns high-resolution teacher features with a lightweight student via feature upsampling and a four-term loss, further reducing latency without sacrificing performance. Beyond segmentation, SeaFormer++ extends to image classification and object detection, with public code and models, offering a versatile, mobile-friendly backbone for diverse vision tasks.

Abstract

Since the introduction of Vision Transformers, the landscape of many computer vision tasks (e.g., semantic segmentation), which has been overwhelmingly dominated by CNNs, recently has significantly revolutionized. However, the computational cost and memory requirement renders these methods unsuitable on the mobile device. In this paper, we introduce a new method squeeze-enhanced Axial Transformer (SeaFormer) for mobile visual recognition. Specifically, we design a generic attention block characterized by the formulation of squeeze Axial and detail enhancement. It can be further used to create a family of backbone architectures with superior cost-effectiveness. Coupled with a light segmentation head, we achieve the best trade-off between segmentation accuracy and latency on the ARM-based mobile devices on the ADE20K, Cityscapes, Pascal Context and COCO-Stuff datasets. Critically, we beat both the mobilefriendly rivals and Transformer-based counterparts with better performance and lower latency without bells and whistles. Furthermore, we incorporate a feature upsampling-based multi-resolution distillation technique, further reducing the inference latency of the proposed framework. Beyond semantic segmentation, we further apply the proposed SeaFormer architecture to image classification and object detection problems, demonstrating the potential of serving as a versatile mobile-friendly backbone. Our code and models are made publicly available at https://github.com/fudan-zvg/SeaFormer.

SeaFormer++: Squeeze-enhanced Axial Transformer for Mobile Visual Recognition

TL;DR

SeaFormer introduces squeeze-enhanced Axial Transformer blocks (SEA attention) to dramatically reduce self-attention cost from traditional Vision Transformers, achieving linear-like scaling with input size for mobile visual recognition. The architecture employs a two-branch design (context and spatial) with a lightweight fusion head, enabling high-resolution segmentation on ARM devices while maintaining competitive accuracy. A multi-resolution distillation framework aligns high-resolution teacher features with a lightweight student via feature upsampling and a four-term loss, further reducing latency without sacrificing performance. Beyond segmentation, SeaFormer++ extends to image classification and object detection, with public code and models, offering a versatile, mobile-friendly backbone for diverse vision tasks.

Abstract

Since the introduction of Vision Transformers, the landscape of many computer vision tasks (e.g., semantic segmentation), which has been overwhelmingly dominated by CNNs, recently has significantly revolutionized. However, the computational cost and memory requirement renders these methods unsuitable on the mobile device. In this paper, we introduce a new method squeeze-enhanced Axial Transformer (SeaFormer) for mobile visual recognition. Specifically, we design a generic attention block characterized by the formulation of squeeze Axial and detail enhancement. It can be further used to create a family of backbone architectures with superior cost-effectiveness. Coupled with a light segmentation head, we achieve the best trade-off between segmentation accuracy and latency on the ARM-based mobile devices on the ADE20K, Cityscapes, Pascal Context and COCO-Stuff datasets. Critically, we beat both the mobilefriendly rivals and Transformer-based counterparts with better performance and lower latency without bells and whistles. Furthermore, we incorporate a feature upsampling-based multi-resolution distillation technique, further reducing the inference latency of the proposed framework. Beyond semantic segmentation, we further apply the proposed SeaFormer architecture to image classification and object detection problems, demonstrating the potential of serving as a versatile mobile-friendly backbone. Our code and models are made publicly available at https://github.com/fudan-zvg/SeaFormer.
Paper Structure (63 sections, 8 equations, 10 figures, 18 tables)

This paper contains 63 sections, 8 equations, 10 figures, 18 tables.

Figures (10)

  • Figure 1: Left: Latency comparison with Transformer vaswani2017attention, MixFormer chen2022mixformer, ACmix pan2022integration, Axial attention ho2019axial and local attention luong2015effective. It is measured with a single module of channel dimension 64 on a Qualcomm Snapdragon 865 processor. Right: The mIoU versus latency on the ADE20K val set. MV2 means MobileNetV2 sandler2018mobilenetv2. MV3-L means MobileNetV3-Large howard2019searching. MV3-Lr denotes MobileNetV3-Large-reduce howard2019searching. The latency is measured on a single Qualcomm Snapdragon 865, and only an ARM CPU core is used for speed testing. No other means of acceleration, e.g., GPU or quantification, is used. For figure Right, the input size is 512×512. SeaFormer achieves superior trade-off between mIoU and latency.
  • Figure 2: The overall architecture of SeaFormer. It contains shared STEM, context branch (red), spatial branch (blue), fusion block and light segmentation head. MV2 block means MobileNetV2 block and MV2$\downarrow$2 means MobileNetV2 block with downsampling. SeaFormer layers and fusion block with dash box only exist in SeaFormer-L. The symbol $\bigotimes$ denotes element-wise multiplication.
  • Figure 3: Right: the schematic illustration of the proposed squeeze-enhanced Axial Transformer layer including a squeeze-enhanced Axial attention and a Feed-Forward Network (FFN). Left is the squeeze-enhanced Axial Transformer layer, including detail enhancement kernel and squeeze Axial attention. The symbol $\bigoplus$ indicates an element-wise addition operation. Mul means multiplication.
  • Figure 4: Left: the schematic diagram of the proposed adaptive squeezing. Right is the adaptive expanding operation. Mat mul means matrix multiplication. Attn out is the output of the multi-head attention.
  • Figure 5: The overall pipeline of multi-resolution distillation based on feature up-sampling. MV2(E=4) denotes MobileNetV2 block with an expansion ratio of 4, and the default kernel size for depth-wise convolution is 5.
  • ...and 5 more figures