Table of Contents
Fetching ...

Lightweight Vision Transformer with Bidirectional Interaction

Qihang Fan, Huaibo Huang, Xiaoqiang Zhou, Ran He

TL;DR

This work tackles the efficiency gap in Vision Transformers by introducing Fully Adaptive Self-Attention (FASA), a module that models local and global contexts with explicit bidirectional interaction. Built upon FASA, the Fully Adaptive Transformer (FAT) family delivers lightweight, hierarchical backbones that achieve competitive or state-of-the-art accuracy across image classification, semantic segmentation, and object detection with minimal parameters and FLOPs. The key innovations are context-aware feature aggregation (CAFA), fine-grained downsampling for global perception, and cross-modulation-based bidirectional interaction that fuses local and global information effectively. The results demonstrate that FAT delivers strong performance and speed on GPUs, highlighting its practical value for resource-constrained vision tasks, with code to be released publicly.

Abstract

Recent advancements in vision backbones have significantly improved their performance by simultaneously modeling images' local and global contexts. However, the bidirectional interaction between these two contexts has not been well explored and exploited, which is important in the human visual system. This paper proposes a Fully Adaptive Self-Attention (FASA) mechanism for vision transformer to model the local and global information as well as the bidirectional interaction between them in context-aware ways. Specifically, FASA employs self-modulated convolutions to adaptively extract local representation while utilizing self-attention in down-sampled space to extract global representation. Subsequently, it conducts a bidirectional adaptation process between local and global representation to model their interaction. In addition, we introduce a fine-grained downsampling strategy to enhance the down-sampled self-attention mechanism for finer-grained global perception capability. Based on FASA, we develop a family of lightweight vision backbones, Fully Adaptive Transformer (FAT) family. Extensive experiments on multiple vision tasks demonstrate that FAT achieves impressive performance. Notably, FAT accomplishes a 77.6% accuracy on ImageNet-1K using only 4.5M parameters and 0.7G FLOPs, which surpasses the most advanced ConvNets and Transformers with similar model size and computational costs. Moreover, our model exhibits faster speed on modern GPU compared to other models. Code will be available at https://github.com/qhfan/FAT.

Lightweight Vision Transformer with Bidirectional Interaction

TL;DR

This work tackles the efficiency gap in Vision Transformers by introducing Fully Adaptive Self-Attention (FASA), a module that models local and global contexts with explicit bidirectional interaction. Built upon FASA, the Fully Adaptive Transformer (FAT) family delivers lightweight, hierarchical backbones that achieve competitive or state-of-the-art accuracy across image classification, semantic segmentation, and object detection with minimal parameters and FLOPs. The key innovations are context-aware feature aggregation (CAFA), fine-grained downsampling for global perception, and cross-modulation-based bidirectional interaction that fuses local and global information effectively. The results demonstrate that FAT delivers strong performance and speed on GPUs, highlighting its practical value for resource-constrained vision tasks, with code to be released publicly.

Abstract

Recent advancements in vision backbones have significantly improved their performance by simultaneously modeling images' local and global contexts. However, the bidirectional interaction between these two contexts has not been well explored and exploited, which is important in the human visual system. This paper proposes a Fully Adaptive Self-Attention (FASA) mechanism for vision transformer to model the local and global information as well as the bidirectional interaction between them in context-aware ways. Specifically, FASA employs self-modulated convolutions to adaptively extract local representation while utilizing self-attention in down-sampled space to extract global representation. Subsequently, it conducts a bidirectional adaptation process between local and global representation to model their interaction. In addition, we introduce a fine-grained downsampling strategy to enhance the down-sampled self-attention mechanism for finer-grained global perception capability. Based on FASA, we develop a family of lightweight vision backbones, Fully Adaptive Transformer (FAT) family. Extensive experiments on multiple vision tasks demonstrate that FAT achieves impressive performance. Notably, FAT accomplishes a 77.6% accuracy on ImageNet-1K using only 4.5M parameters and 0.7G FLOPs, which surpasses the most advanced ConvNets and Transformers with similar model size and computational costs. Moreover, our model exhibits faster speed on modern GPU compared to other models. Code will be available at https://github.com/qhfan/FAT.
Paper Structure (32 sections, 12 equations, 5 figures, 14 tables)

This paper contains 32 sections, 12 equations, 5 figures, 14 tables.

Figures (5)

  • Figure 1: Illustration of the human visual system (top) and our FASA (bottom). The human visual system can perceive both local and global contexts and model the bidirectional interaction between them. Our FASA follows this mechanism and consists of three parts: (a) local adaptive aggregation, (b) global adaptive aggregation, and (c) bidirectional adaptive interaction. Our FASA models local information, global information, and local-global bidirectional interaction in context-aware manners.
  • Figure 2: Top-1 accuracy v.s. FLOPs on ImageNet-1K of recent SOTA CNN and transformer models. The proposed Fully Adaptive Transformer (FAT) outperforms all the counterparts in all settings.
  • Figure 3: Illustration of the FAT. FAT is composed of multiple FAT blocks. A FAT block consists of CPE, FASA and ConvFFN.
  • Figure 4: Spectral analysis from 8 output channels of FASA. The larger magnitude has a lighter color. Pixels that are closer to the center have a lower frequency. From top to bottom, the results are from (a) local adaptive aggregation, (b) global adaptive aggregation, (c) add+linear fusion, and (d) bidirectional adaptive interaction.
  • Figure 5: Comparison between our bidirectional adaptive interaction and traditional fusion method.