Table of Contents
Fetching ...

SHViT: Single-Head Vision Transformer with Memory Efficient Macro Design

Seokju Yun, Youngmin Ro

TL;DR

This paper tackles the memory and computation bottlenecks of Vision Transformers by identifying redundancies in both macro design (patch embedding and stage count) and micro design (multi-head attention). It proposes SHViT, which combines a memory-efficient macro design using a large-stride $16×16$ patch embedding and a three-stage hierarchy with a single-head self-attention (SHSA) module applied to partial channels, complemented by depthwise convolutions. The resulting SHViT family achieves superior speed-accuracy tradeoffs on ImageNet-1K and delivers competitive or faster backbones for COCO object detection and segmentation across GPUs, CPUs, and mobile devices. The work demonstrates substantial gains in throughput and reduced memory access costs while maintaining high accuracy, suggesting broad practical impact for real-time vision tasks on constrained hardware.

Abstract

Recently, efficient Vision Transformers have shown great performance with low latency on resource-constrained devices. Conventionally, they use 4x4 patch embeddings and a 4-stage structure at the macro level, while utilizing sophisticated attention with multi-head configuration at the micro level. This paper aims to address computational redundancy at all design levels in a memory-efficient manner. We discover that using larger-stride patchify stem not only reduces memory access costs but also achieves competitive performance by leveraging token representations with reduced spatial redundancy from the early stages. Furthermore, our preliminary analyses suggest that attention layers in the early stages can be substituted with convolutions, and several attention heads in the latter stages are computationally redundant. To handle this, we introduce a single-head attention module that inherently prevents head redundancy and simultaneously boosts accuracy by parallelly combining global and local information. Building upon our solutions, we introduce SHViT, a Single-Head Vision Transformer that obtains the state-of-the-art speed-accuracy tradeoff. For example, on ImageNet-1k, our SHViT-S4 is 3.3x, 8.1x, and 2.4x faster than MobileViTv2 x1.0 on GPU, CPU, and iPhone12 mobile device, respectively, while being 1.3% more accurate. For object detection and instance segmentation on MS COCO using Mask-RCNN head, our model achieves performance comparable to FastViT-SA12 while exhibiting 3.8x and 2.0x lower backbone latency on GPU and mobile device, respectively.

SHViT: Single-Head Vision Transformer with Memory Efficient Macro Design

TL;DR

This paper tackles the memory and computation bottlenecks of Vision Transformers by identifying redundancies in both macro design (patch embedding and stage count) and micro design (multi-head attention). It proposes SHViT, which combines a memory-efficient macro design using a large-stride patch embedding and a three-stage hierarchy with a single-head self-attention (SHSA) module applied to partial channels, complemented by depthwise convolutions. The resulting SHViT family achieves superior speed-accuracy tradeoffs on ImageNet-1K and delivers competitive or faster backbones for COCO object detection and segmentation across GPUs, CPUs, and mobile devices. The work demonstrates substantial gains in throughput and reduced memory access costs while maintaining high accuracy, suggesting broad practical impact for real-time vision tasks on constrained hardware.

Abstract

Recently, efficient Vision Transformers have shown great performance with low latency on resource-constrained devices. Conventionally, they use 4x4 patch embeddings and a 4-stage structure at the macro level, while utilizing sophisticated attention with multi-head configuration at the micro level. This paper aims to address computational redundancy at all design levels in a memory-efficient manner. We discover that using larger-stride patchify stem not only reduces memory access costs but also achieves competitive performance by leveraging token representations with reduced spatial redundancy from the early stages. Furthermore, our preliminary analyses suggest that attention layers in the early stages can be substituted with convolutions, and several attention heads in the latter stages are computationally redundant. To handle this, we introduce a single-head attention module that inherently prevents head redundancy and simultaneously boosts accuracy by parallelly combining global and local information. Building upon our solutions, we introduce SHViT, a Single-Head Vision Transformer that obtains the state-of-the-art speed-accuracy tradeoff. For example, on ImageNet-1k, our SHViT-S4 is 3.3x, 8.1x, and 2.4x faster than MobileViTv2 x1.0 on GPU, CPU, and iPhone12 mobile device, respectively, while being 1.3% more accurate. For object detection and instance segmentation on MS COCO using Mask-RCNN head, our model achieves performance comparable to FastViT-SA12 while exhibiting 3.8x and 2.0x lower backbone latency on GPU and mobile device, respectively.
Paper Structure (19 sections, 4 equations, 10 figures, 9 tables)

This paper contains 19 sections, 4 equations, 10 figures, 9 tables.

Figures (10)

  • Figure 1: Comparison of throughput and accuracy between our SHViT and other recent methods.
  • Figure 2: Macro design analysis. All stages are composed of MetaFormer blocks yu2022metaformer. The stages depicted in blue and red utilize depthwise convolution and attention layers as token mixer, respectively. In the table below, the macro design numbers represent the number of channels, while the numbers in parentheses indicate the number of blocks.
  • Figure 3: Multi-head redundancy analysis on DeiT deit. To better analyze head redundancy, we increase the number of heads in DeiT-T from 3 to 6 and retrain the model. We compute the attention maps and calculate the average cosine similarity between each head in different layers across 128 test samples from ImageNet. The importance of each head is determined by its score when it is removed and when it is left alone. Zoom-in for better visibility.
  • Figure 4: Multi-head redundancy analysis on Swin liu2021swin. We scale down by halving the width of Swin-T. Left: the average cosine similarity. Right: head masking results. The process of deriving the results is the same as the DeiT experiment. (Fig. \ref{['fig: deit_redundancy']})
  • Figure 5: Overview of Single-Head Vision Transformer (SHViT). The model starts with a 16$\times$16 overlapping patch embedding layer and uses single-head attention layers in the latter stages to efficiently compute global dependencies. See text for details.
  • ...and 5 more figures