Table of Contents
Fetching ...

Fusion of regional and sparse attention in Vision Transformers

Nabil Ibtehaz, Ning Yan, Masood Mortazavi, Daisuke Kihara

TL;DR

This work addresses the trade-off between local hierarchical interactions and global context in Vision Transformers by introducing Atrous Attention, a fusion of regional and sparse attention inspired by dilated convolution. The authors build ACC-ViT, a hybrid backbone that uses multi-dilation windows, adaptive gating, a shared MLP, and parallel atrous convolutions to maintain hierarchy while capturing global information. Empirically, ACC-ViT achieves competitive ImageNet-1K performance with fewer than 28.5M parameters (≈84% top-1 accuracy) and outperforms MaxViT at similar FLOPs with fewer parameters; transfer experiments on medical datasets further demonstrate strong generalization. The approach also offers improved interpretability via Grad-CAM, suggesting more coherent attention to objects and context, and has potential for broad applicability across vision tasks requiring efficient, multi-scale context integration.

Abstract

Modern vision transformers leverage visually inspired local interaction between pixels through attention computed within window or grid regions, in contrast to the global attention employed in the original ViT. Regional attention restricts pixel interactions within specific regions, while sparse attention disperses them across sparse grids. These differing approaches pose a challenge between maintaining hierarchical relationships vs. capturing a global context. In this study, drawing inspiration from atrous convolution, we propose Atrous Attention, a blend of regional and sparse attention that dynamically integrates both local and global information while preserving hierarchical structures. Based on this, we introduce a versatile, hybrid vision transformer backbone called ACC-ViT, tailored for standard vision tasks. Our compact model achieves approximately 84% accuracy on ImageNet-1K with fewer than 28.5 million parameters, outperforming the state-of-the-art MaxViT by 0.42% while requiring 8.4% fewer parameters.

Fusion of regional and sparse attention in Vision Transformers

TL;DR

This work addresses the trade-off between local hierarchical interactions and global context in Vision Transformers by introducing Atrous Attention, a fusion of regional and sparse attention inspired by dilated convolution. The authors build ACC-ViT, a hybrid backbone that uses multi-dilation windows, adaptive gating, a shared MLP, and parallel atrous convolutions to maintain hierarchy while capturing global information. Empirically, ACC-ViT achieves competitive ImageNet-1K performance with fewer than 28.5M parameters (≈84% top-1 accuracy) and outperforms MaxViT at similar FLOPs with fewer parameters; transfer experiments on medical datasets further demonstrate strong generalization. The approach also offers improved interpretability via Grad-CAM, suggesting more coherent attention to objects and context, and has potential for broad applicability across vision tasks requiring efficient, multi-scale context integration.

Abstract

Modern vision transformers leverage visually inspired local interaction between pixels through attention computed within window or grid regions, in contrast to the global attention employed in the original ViT. Regional attention restricts pixel interactions within specific regions, while sparse attention disperses them across sparse grids. These differing approaches pose a challenge between maintaining hierarchical relationships vs. capturing a global context. In this study, drawing inspiration from atrous convolution, we propose Atrous Attention, a blend of regional and sparse attention that dynamically integrates both local and global information while preserving hierarchical structures. Based on this, we introduce a versatile, hybrid vision transformer backbone called ACC-ViT, tailored for standard vision tasks. Our compact model achieves approximately 84% accuracy on ImageNet-1K with fewer than 28.5 million parameters, outperforming the state-of-the-art MaxViT by 0.42% while requiring 8.4% fewer parameters.
Paper Structure (23 sections, 3 equations, 5 figures, 8 tables)

This paper contains 23 sections, 3 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Simplified illustrations of different types of windowed attention mechanisms.
  • Figure 2: ACC-ViT model architecture and its components.
  • Figure 3: ACC-ViT performs competitively against state-of-the-art models on ImageNet-1K.
  • Figure 4: Model Interpretation using Grad-CAM.
  • Figure 5: Confusion Matrix of the different models on the three datasets.