SeaFormer++: Squeeze-enhanced Axial Transformer for Mobile Visual Recognition
Qiang Wan, Zilong Huang, Jiachen Lu, Gang Yu, Li Zhang
TL;DR
SeaFormer introduces squeeze-enhanced Axial Transformer blocks (SEA attention) to dramatically reduce self-attention cost from traditional Vision Transformers, achieving linear-like scaling with input size for mobile visual recognition. The architecture employs a two-branch design (context and spatial) with a lightweight fusion head, enabling high-resolution segmentation on ARM devices while maintaining competitive accuracy. A multi-resolution distillation framework aligns high-resolution teacher features with a lightweight student via feature upsampling and a four-term loss, further reducing latency without sacrificing performance. Beyond segmentation, SeaFormer++ extends to image classification and object detection, with public code and models, offering a versatile, mobile-friendly backbone for diverse vision tasks.
Abstract
Since the introduction of Vision Transformers, the landscape of many computer vision tasks (e.g., semantic segmentation), which has been overwhelmingly dominated by CNNs, recently has significantly revolutionized. However, the computational cost and memory requirement renders these methods unsuitable on the mobile device. In this paper, we introduce a new method squeeze-enhanced Axial Transformer (SeaFormer) for mobile visual recognition. Specifically, we design a generic attention block characterized by the formulation of squeeze Axial and detail enhancement. It can be further used to create a family of backbone architectures with superior cost-effectiveness. Coupled with a light segmentation head, we achieve the best trade-off between segmentation accuracy and latency on the ARM-based mobile devices on the ADE20K, Cityscapes, Pascal Context and COCO-Stuff datasets. Critically, we beat both the mobilefriendly rivals and Transformer-based counterparts with better performance and lower latency without bells and whistles. Furthermore, we incorporate a feature upsampling-based multi-resolution distillation technique, further reducing the inference latency of the proposed framework. Beyond semantic segmentation, we further apply the proposed SeaFormer architecture to image classification and object detection problems, demonstrating the potential of serving as a versatile mobile-friendly backbone. Our code and models are made publicly available at https://github.com/fudan-zvg/SeaFormer.
