Table of Contents
Fetching ...

DeBiFormer: Vision Transformer with Deformable Agent Bi-level Routing Attention

Nguyen Huu Bao Long, Chenyu Zhang, Yuzhi Shi, Tsubasa Hirakawa, Takayoshi Yamashita, Tohgoroh Matsui, Hironobu Fujiyoshi

TL;DR

The Deformable Bi-level Routing Attention Transformer (DeBiFormer), a novel general-purpose vision transformer built with the DBRA module, which optimizes the selection of key-value pairs using agent queries and enhances the interpretability of queries in attention maps.

Abstract

Vision Transformers with various attention modules have demonstrated superior performance on vision tasks. While using sparsity-adaptive attention, such as in DAT, has yielded strong results in image classification, the key-value pairs selected by deformable points lack semantic relevance when fine-tuning for semantic segmentation tasks. The query-aware sparsity attention in BiFormer seeks to focus each query on top-k routed regions. However, during attention calculation, the selected key-value pairs are influenced by too many irrelevant queries, reducing attention on the more important ones. To address these issues, we propose the Deformable Bi-level Routing Attention (DBRA) module, which optimizes the selection of key-value pairs using agent queries and enhances the interpretability of queries in attention maps. Based on this, we introduce the Deformable Bi-level Routing Attention Transformer (DeBiFormer), a novel general-purpose vision transformer built with the DBRA module. DeBiFormer has been validated on various computer vision tasks, including image classification, object detection, and semantic segmentation, providing strong evidence of its effectiveness.Code is available at {https://github.com/maclong01/DeBiFormer}

DeBiFormer: Vision Transformer with Deformable Agent Bi-level Routing Attention

TL;DR

The Deformable Bi-level Routing Attention Transformer (DeBiFormer), a novel general-purpose vision transformer built with the DBRA module, which optimizes the selection of key-value pairs using agent queries and enhances the interpretability of queries in attention maps.

Abstract

Vision Transformers with various attention modules have demonstrated superior performance on vision tasks. While using sparsity-adaptive attention, such as in DAT, has yielded strong results in image classification, the key-value pairs selected by deformable points lack semantic relevance when fine-tuning for semantic segmentation tasks. The query-aware sparsity attention in BiFormer seeks to focus each query on top-k routed regions. However, during attention calculation, the selected key-value pairs are influenced by too many irrelevant queries, reducing attention on the more important ones. To address these issues, we propose the Deformable Bi-level Routing Attention (DBRA) module, which optimizes the selection of key-value pairs using agent queries and enhances the interpretability of queries in attention maps. Based on this, we introduce the Deformable Bi-level Routing Attention Transformer (DeBiFormer), a novel general-purpose vision transformer built with the DBRA module. DeBiFormer has been validated on various computer vision tasks, including image classification, object detection, and semantic segmentation, providing strong evidence of its effectiveness.Code is available at {https://github.com/maclong01/DeBiFormer}

Paper Structure

This paper contains 28 sections, 14 equations, 6 figures, 11 tables.

Figures (6)

  • Figure 1: Vanilla attention and its sparse variants are illustrated in the diagram: (a) vanilla attention functions globally, leading to increased computational complexity and substantial memory footprint. (b)-(c) Multiple strategies strive to reduce complexity by incorporating sparse attention with various handcrafted patterns, such as local window LOCAL and dilated window CROSSMAXVITDILATEDFORMER. (d) Deformable attention DEF facilitates image-adaptive sparsity by deforming the regular grid. (e) Bi-level routing attention BIFORMER begins by searching for top-k (k = 3 in this case) relevant regions and subsequently attends to the union of these regions. (f) In our approach, we realize bi-level routing attention, where the initial step involves searching for top-k (k = 1 in this case) relevant regions. Subsequently, attention is directed to the union of these regions by deforming regular grid attendance via top-k relevant regions.
  • Figure 2: Detailed architecture of Deformable Bi-level Routing Attention. In the top-left part, the set of reference points is uniformly distributed across the feature map. Offsets for these points are learned from queries through the offset network. Then, in the top-middle part, deformed features are projected from sampled features based on the locations of deformed points. In the bottom-left-middle part, we attend to projected deformed features by utilizing gathered key-value pairs in top-k-related windows.
  • Figure 3: Overall model architecture of our DeBiFormer. Left: Network architecture of DeBiFormer. $N_{1}$ to $N_{4}$ represent numbers of stacked successive local and Deformable Bi-level Routing Attention blocks. Please consult Table \ref{['table-architecture']} for specific configurations. Right: Details on DeBiFormer Block.
  • Figure 4: Grad-CAM Visualization of BiFormer-Base and DeBiFormer-Base. These images are sampled from the validation set of ImageNet-1K.
  • Figure 5: ERF visualization of models incorporating various local operators and SOTA methods. The results are obtained by averaging over 100 images (resized to 224×224) from ImageNet.
  • ...and 1 more figures