Table of Contents
Fetching ...

FasterViT: Fast Vision Transformers with Hierarchical Attention

Ali Hatamizadeh, Greg Heinrich, Hongxu Yin, Andrew Tao, Jose M. Alvarez, Jan Kautz, Pavlo Molchanov

TL;DR

FasterViT addresses the high compute cost of global self-attention in vision transformers by introducing Hierarchical Attention (HAT), a multi-level, carrier-token–assisted mechanism that enables efficient cross-window communication. The architecture fuses CNN blocks in early stages with transformer blocks later, and uses HAT to capture long-range dependencies without sacrificing image throughput. Across ImageNet-1K, MS COCO, and ADE20K, FasterViT achieves state-of-the-art Pareto front performance and scales well with ImageNet-21K pretraining. The work also demonstrates HAT’s plug-and-play utility and provides extensive ablations on token size and architectural variants, underscoring its practical impact for high-resolution vision tasks.

Abstract

We design a new family of hybrid CNN-ViT neural networks, named FasterViT, with a focus on high image throughput for computer vision (CV) applications. FasterViT combines the benefits of fast local representation learning in CNNs and global modeling properties in ViT. Our newly introduced Hierarchical Attention (HAT) approach decomposes global self-attention with quadratic complexity into a multi-level attention with reduced computational costs. We benefit from efficient window-based self-attention. Each window has access to dedicated carrier tokens that participate in local and global representation learning. At a high level, global self-attentions enable the efficient cross-window communication at lower costs. FasterViT achieves a SOTA Pareto-front in terms of accuracy and image throughput. We have extensively validated its effectiveness on various CV tasks including classification, object detection and segmentation. We also show that HAT can be used as a plug-and-play module for existing networks and enhance them. We further demonstrate significantly faster and more accurate performance than competitive counterparts for images with high resolution. Code is available at https://github.com/NVlabs/FasterViT.

FasterViT: Fast Vision Transformers with Hierarchical Attention

TL;DR

FasterViT addresses the high compute cost of global self-attention in vision transformers by introducing Hierarchical Attention (HAT), a multi-level, carrier-token–assisted mechanism that enables efficient cross-window communication. The architecture fuses CNN blocks in early stages with transformer blocks later, and uses HAT to capture long-range dependencies without sacrificing image throughput. Across ImageNet-1K, MS COCO, and ADE20K, FasterViT achieves state-of-the-art Pareto front performance and scales well with ImageNet-21K pretraining. The work also demonstrates HAT’s plug-and-play utility and provides extensive ablations on token size and architectural variants, underscoring its practical impact for high-resolution vision tasks.

Abstract

We design a new family of hybrid CNN-ViT neural networks, named FasterViT, with a focus on high image throughput for computer vision (CV) applications. FasterViT combines the benefits of fast local representation learning in CNNs and global modeling properties in ViT. Our newly introduced Hierarchical Attention (HAT) approach decomposes global self-attention with quadratic complexity into a multi-level attention with reduced computational costs. We benefit from efficient window-based self-attention. Each window has access to dedicated carrier tokens that participate in local and global representation learning. At a high level, global self-attentions enable the efficient cross-window communication at lower costs. FasterViT achieves a SOTA Pareto-front in terms of accuracy and image throughput. We have extensively validated its effectiveness on various CV tasks including classification, object detection and segmentation. We also show that HAT can be used as a plug-and-play module for existing networks and enhance them. We further demonstrate significantly faster and more accurate performance than competitive counterparts for images with high resolution. Code is available at https://github.com/NVlabs/FasterViT.
Paper Structure (39 sections, 8 equations, 14 figures, 14 tables)

This paper contains 39 sections, 8 equations, 14 figures, 14 tables.

Figures (14)

  • Figure 1: Comparison of image throughput and ImageNet-1K Top-1 accuracy. Throughput is measured on A100 GPU with batch size of 128.
  • Figure 2: Visualization of the proposed Hierarchical Attention in the feature space. By performing local window attention and hierarchical attention we can achieve global information propagation at reduced costs.
  • Figure 3: Overview of the FasterViT architecture. We use a multi-scale architecture with CNN and transformer-based blocks in stages 1, 2 and 3, 4, respectively. Best viewed in color.
  • Figure 4: Proposed Hierarchical Attention block. Carrier tokens (CT) learn a summary of each local window and facilitate global information exchange between local windows. Local window tokens only have access to a dedicated subset of CT for efficient attention. CT undergo full self-attention to enable cross-window attention. "Attention" stands for MHSA vaswani2017attention, MLP for multi-layer perceptron. Best viewed in color.
  • Figure 5: Attention map comparison for a feature map of size $H\times H \times d$. - no attention, - normal token attention, - carrier token attention, - random token attention. Full attention (a) has complexity of $O(H^4d)$, windowed attention significantly reduces it to $O(k^2H^2d)$ but lacks global context.
  • ...and 9 more figures