Table of Contents
Fetching ...

ViG: Linear-complexity Visual Sequence Learning with Gated Linear Attention

Bencheng Liao, Xinggang Wang, Lianghui Zhu, Qian Zhang, Chang Huang

TL;DR

ViG introduces Gated Linear Attention (GLA) for vision to achieve global receptive fields with linear complexity, addressing the inefficiency of traditional Transformers on high-resolution imagery. It adds Bidirectional GLA (BiGLA) with direction-wise gating and a 2D gating locality injection to fuse 1D global context with 2D local details, all implemented with a hardware-conscious single-kernel design. The approach yields strong accuracy with lower parameters and FLOPs across ImageNet, COCO, and ADE20K, outperforming both Transformer- and CNN-based baselines, especially at larger resolutions. While ViG shows clear practical advantages, it acknowledges a small gap to DeiT on very small inputs and points to future optimizations to further close this gap.

Abstract

Recently, linear complexity sequence modeling networks have achieved modeling capabilities similar to Vision Transformers on a variety of computer vision tasks, while using fewer FLOPs and less memory. However, their advantage in terms of actual runtime speed is not significant. To address this issue, we introduce Gated Linear Attention (GLA) for vision, leveraging its superior hardware-awareness and efficiency. We propose direction-wise gating to capture 1D global context through bidirectional modeling and a 2D gating locality injection to adaptively inject 2D local details into 1D global context. Our hardware-aware implementation further merges forward and backward scanning into a single kernel, enhancing parallelism and reducing memory cost and latency. The proposed model, ViG, offers a favorable trade-off in accuracy, parameters, and FLOPs on ImageNet and downstream tasks, outperforming popular Transformer and CNN-based models. Notably, ViG-S matches DeiT-B's accuracy while using only 27% of the parameters and 20% of the FLOPs, running 2$\times$ faster on $224\times224$ images. At $1024\times1024$ resolution, ViG-T uses 5.2$\times$ fewer FLOPs, saves 90% GPU memory, runs 4.8$\times$ faster, and achieves 20.7% higher top-1 accuracy than DeiT-T. These results position ViG as an efficient and scalable solution for visual representation learning. Code is available at \url{https://github.com/hustvl/ViG}.

ViG: Linear-complexity Visual Sequence Learning with Gated Linear Attention

TL;DR

ViG introduces Gated Linear Attention (GLA) for vision to achieve global receptive fields with linear complexity, addressing the inefficiency of traditional Transformers on high-resolution imagery. It adds Bidirectional GLA (BiGLA) with direction-wise gating and a 2D gating locality injection to fuse 1D global context with 2D local details, all implemented with a hardware-conscious single-kernel design. The approach yields strong accuracy with lower parameters and FLOPs across ImageNet, COCO, and ADE20K, outperforming both Transformer- and CNN-based baselines, especially at larger resolutions. While ViG shows clear practical advantages, it acknowledges a small gap to DeiT on very small inputs and points to future optimizations to further close this gap.

Abstract

Recently, linear complexity sequence modeling networks have achieved modeling capabilities similar to Vision Transformers on a variety of computer vision tasks, while using fewer FLOPs and less memory. However, their advantage in terms of actual runtime speed is not significant. To address this issue, we introduce Gated Linear Attention (GLA) for vision, leveraging its superior hardware-awareness and efficiency. We propose direction-wise gating to capture 1D global context through bidirectional modeling and a 2D gating locality injection to adaptively inject 2D local details into 1D global context. Our hardware-aware implementation further merges forward and backward scanning into a single kernel, enhancing parallelism and reducing memory cost and latency. The proposed model, ViG, offers a favorable trade-off in accuracy, parameters, and FLOPs on ImageNet and downstream tasks, outperforming popular Transformer and CNN-based models. Notably, ViG-S matches DeiT-B's accuracy while using only 27% of the parameters and 20% of the FLOPs, running 2 faster on images. At resolution, ViG-T uses 5.2 fewer FLOPs, saves 90% GPU memory, runs 4.8 faster, and achieves 20.7% higher top-1 accuracy than DeiT-T. These results position ViG as an efficient and scalable solution for visual representation learning. Code is available at \url{https://github.com/hustvl/ViG}.
Paper Structure (20 sections, 7 equations, 6 figures, 5 tables)

This paper contains 20 sections, 7 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Performance comparisons of (a) non-hierarchical architectures vrwkvdeitvimnguyen2022s4nd and (b) hierarchical architectures vmambaconvnextswinnguyen2022s4ndregnetresnet on ImageNet-1K. Our proposed non-hierarchical ViG and hierarchical ViG-H demonstrate superior performance compared to the popular models in terms of parameters and accuracy. Particularly, the proposed basic ViG block achieves global receptive field with linear complexity, while the CNN resnetregnetconvnext, vanilla softmax attention deit and window-attention-based swin blocks cannot.
  • Figure 2: The overall architecture of ViG. We follow ViT vit to build architecture by first transforming the input image into a sequence of patches and then feeding it into $N$ basic ViG blocks. The proposed ViG block consists of RMSNorm zhang2019root, the proposed linear complexity spatial mixing layer, and SwiGLU Feed Forward Network shazeer2020glu.
  • Figure 3: Illustration of BiGLA.
  • Figure 4: Comparison among ViG, Vim vim, VRWKV vrwkv, and ViT vitdeit in (a) FLOPs, (b) memory, (c) latency, and (d) accuracy with respect to increasing image resolution during inference on ImageNet-1K val set. The blue dashed line indicates the estimated values when the GPU memory has run out. We benchmark the latency with the maximum batch size that can make models runnable on the GPU to ensure full GPU utilization and provide available results at high resolutions.
  • Figure 5: Visualization of attention maps.
  • ...and 1 more figures