ACC-ViT : Atrous Convolution's Comeback in Vision Transformers

Nabil Ibtehaz; Ning Yan; Masood Mortazavi; Daisuke Kihara

ACC-ViT : Atrous Convolution's Comeback in Vision Transformers

Nabil Ibtehaz, Ning Yan, Masood Mortazavi, Daisuke Kihara

TL;DR

This work introduces Atrous Attention, a fusion of regional and sparse attention, which can adaptively consolidate both local and global information, while maintaining hierarchical relations and proposes a generalized, hybrid vision transformer backbone, named ACC-ViT, following conventional practices for standard vision tasks.

Abstract

Transformers have elevated to the state-of-the-art vision architectures through innovations in attention mechanism inspired from visual perception. At present two classes of attentions prevail in vision transformers, regional and sparse attention. The former bounds the pixel interactions within a region; the latter spreads them across sparse grids. The opposing natures of them have resulted in a dilemma between either preserving hierarchical relation or attaining a global context. In this work, taking inspiration from atrous convolution, we introduce Atrous Attention, a fusion of regional and sparse attention, which can adaptively consolidate both local and global information, while maintaining hierarchical relations. As a further tribute to atrous convolution, we redesign the ubiquitous inverted residual convolution blocks with atrous convolution. Finally, we propose a generalized, hybrid vision transformer backbone, named ACC-ViT, following conventional practices for standard vision tasks. Our tiny version model achieves $\sim 84 \%$ accuracy on ImageNet-1K, with less than $28.5$ million parameters, which is $0.42\%$ improvement over state-of-the-art MaxViT while having $8.4\%$ less parameters. In addition, we have investigated the efficacy of ACC-ViT backbone under different evaluation settings, such as finetuning, linear probing, and zero-shot learning on tasks involving medical image analysis, object detection, and language-image contrastive learning. ACC-ViT is therefore a strong vision backbone, which is also competitive in mobile-scale versions, ideal for niche applications with small datasets.

ACC-ViT : Atrous Convolution's Comeback in Vision Transformers

TL;DR

Abstract

accuracy on ImageNet-1K, with less than

million parameters, which is

improvement over state-of-the-art MaxViT while having

less parameters. In addition, we have investigated the efficacy of ACC-ViT backbone under different evaluation settings, such as finetuning, linear probing, and zero-shot learning on tasks involving medical image analysis, object detection, and language-image contrastive learning. ACC-ViT is therefore a strong vision backbone, which is also competitive in mobile-scale versions, ideal for niche applications with small datasets.

Paper Structure (21 sections, 12 equations, 5 figures, 6 tables)

This paper contains 21 sections, 12 equations, 5 figures, 6 tables.

Introduction
Related Works
CNN vs ViT vs Hybrid models
Attention mechanisms for ViTs
Atrous Convolution
Methods
Atrous Attention
Gating Operation for Adaptive Fusion of Different Branches
Shared MLP Layer across Parallel Attentions
Parallel Atrous Inverted Residual Convolution
ACC-ViT
Experiments and Results
ImageNet Image Classification
Transfer Learning Experiment on Medical Image Datasets
Frozen Backbone Feature Extraction performance assessment on Object Detection
...and 6 more sections

Figures (5)

Figure 1: Simplified illustrations of different types of windowed attention mechanisms.
Figure 2: ACC-ViT performs competitively against both standard and small-scale models on ImageNet-1K.
Figure 3: ACC-ViT model architecture and its components.
Figure 4: Zero-shot performance assessment of the tiny size models on language-vision contrastive learning.
Figure 5: Model Interpretation using Grad-CAM.

ACC-ViT : Atrous Convolution's Comeback in Vision Transformers

TL;DR

Abstract

ACC-ViT : Atrous Convolution's Comeback in Vision Transformers

Authors

TL;DR

Abstract

Table of Contents

Figures (5)