Separable Self-attention for Mobile Vision Transformers

Sachin Mehta; Mohammad Rastegari

Separable Self-attention for Mobile Vision Transformers

Sachin Mehta, Mohammad Rastegari

TL;DR

The paper tackles the slow, quadratic self-attention bottleneck in mobile vision transformers by introducing separable self-attention with linear complexity $O(k)$ and hardware-friendly element-wise operations. This drop-in replacement for MHA is embedded into MobileViTv2, a lightweight hybrid CNN-ViT model, yielding substantial mobile-speedups while maintaining competitive ImageNet accuracy (75.6% top-1 with ~3M params) and strong performance on segmentation and detection. The approach is validated via extensive ImageNet-1k training, ImageNet-21k-P pretraining, and downstream tasks, showing MobileViTv2 outperforms MobileViTv1 and rivals or surpasses other lightweight architectures, especially on mobile devices. The work demonstrates that carefully designed, low-communication-attention mechanisms can close the latency gap between CNNs and ViTs for mobile applications, with publicly released code for reproducibility.

Abstract

Mobile vision transformers (MobileViT) can achieve state-of-the-art performance across several mobile vision tasks, including classification and detection. Though these models have fewer parameters, they have high latency as compared to convolutional neural network-based models. The main efficiency bottleneck in MobileViT is the multi-headed self-attention (MHA) in transformers, which requires $O(k^2)$ time complexity with respect to the number of tokens (or patches) $k$. Moreover, MHA requires costly operations (e.g., batch-wise matrix multiplication) for computing self-attention, impacting latency on resource-constrained devices. This paper introduces a separable self-attention method with linear complexity, i.e. $O(k)$. A simple yet effective characteristic of the proposed method is that it uses element-wise operations for computing self-attention, making it a good choice for resource-constrained devices. The improved model, MobileViTv2, is state-of-the-art on several mobile vision tasks, including ImageNet object classification and MS-COCO object detection. With about three million parameters, MobileViTv2 achieves a top-1 accuracy of 75.6% on the ImageNet dataset, outperforming MobileViT by about 1% while running $3.2\times$ faster on a mobile device. Our source code is available at: \url{https://github.com/apple/ml-cvnets}

Separable Self-attention for Mobile Vision Transformers

TL;DR

The paper tackles the slow, quadratic self-attention bottleneck in mobile vision transformers by introducing separable self-attention with linear complexity

and hardware-friendly element-wise operations. This drop-in replacement for MHA is embedded into MobileViTv2, a lightweight hybrid CNN-ViT model, yielding substantial mobile-speedups while maintaining competitive ImageNet accuracy (75.6% top-1 with ~3M params) and strong performance on segmentation and detection. The approach is validated via extensive ImageNet-1k training, ImageNet-21k-P pretraining, and downstream tasks, showing MobileViTv2 outperforms MobileViTv1 and rivals or surpasses other lightweight architectures, especially on mobile devices. The work demonstrates that carefully designed, low-communication-attention mechanisms can close the latency gap between CNNs and ViTs for mobile applications, with publicly released code for reproducibility.

Abstract

time complexity with respect to the number of tokens (or patches)

. Moreover, MHA requires costly operations (e.g., batch-wise matrix multiplication) for computing self-attention, impacting latency on resource-constrained devices. This paper introduces a separable self-attention method with linear complexity, i.e.

. A simple yet effective characteristic of the proposed method is that it uses element-wise operations for computing self-attention, making it a good choice for resource-constrained devices. The improved model, MobileViTv2, is state-of-the-art on several mobile vision tasks, including ImageNet object classification and MS-COCO object detection. With about three million parameters, MobileViTv2 achieves a top-1 accuracy of 75.6% on the ImageNet dataset, outperforming MobileViT by about 1% while running

faster on a mobile device. Our source code is available at: \url{https://github.com/apple/ml-cvnets}

Paper Structure (35 sections, 3 equations, 9 figures, 10 tables)

This paper contains 35 sections, 3 equations, 9 figures, 10 tables.

Introduction
Related work
Improving self-attention
Improving transformer-based models
Other methods
MobileViTv2
Overview of multi-headed self-attention
Separable self-attention
Comparison with self-attention methods
Relationship with additive addition
MobileViTv2 architecture
Experimental results
Object classification on the ImageNet dataset
Training on ImageNet-1k from scratch
Pre-training on ImageNet-21k-P and finetuning on ImageNet-1k
...and 20 more sections

Figures (9)

Figure 1: Comparison between different attention units. Transformer and Linformer use costly operations (batch-wise matrix multiplication) for computing self-attention. Such operations are a bottleneck for efficient inference on resource-constrained devices. The proposed method does not use such operations, thus accelerating inference on resource-constrained devices. Left compares top-5 operations (sorted by CPU time) in a single layer of different attention units for $k=256$ tokens. Top Right compares complexity of different attention units. Bottom Right compares the latency of different attention units as a function of the number of tokens $k$. These results are computed on a single CPU core machine with a 2.4 GHz 8-Core Intel Core i9 processor, $d=512$ (token dimensionality), $h=8$ (number of heads; for Transformer and Linformer), and $p=256$ (projected tokens in Linformer) using a publicly available profiler in PyTorch paszke2019pytorch.
Figure 2: MobileViTv2 models are faster and better than MobileViTv1 models mehta2022mobilevit across different tasks.MobileViTv2 models are constructed by replacing multi-headed self-attention in MobileViTv1 with the proposed separable self-attention (\ref{['ssec:separable_self_attn']}). Here, inference time is measured on an iPhone12 for an input resolution of $256 \times 256$, $512 \times 512$, and $320 \times 320$ for classification, segmentation, and detection respectively.
Figure 3: Different self-attention units.(a) is a standard multi-headed self-attention (MHA) in transformers. (b) extends MHA in (a) by introducing token projection layers, which project $k$ tokens to a pre-defined number of tokens $p$, thus reducing the complexity from $O(k^2)$ to $O(k)$. However, it still uses costly operations (e.g., batch-wise matrix multiplication) for computing self-attention, impacting latency on resource-constrained devices (\ref{['fig:compare_different_attn_layers']}). (c) is the proposed separable self-attention layer that is linear in complexity, i.e., $O(k)$, and uses element-wise operations for faster inference.
Figure 4: Example illustrating the interaction between tokens to learn global representations in different attention layers. In (a), each query token computes the distance with all key tokens via dot-product. These distances are then normalized using softmax to produce an attention matrix $\mathbf{a}$, which encodes contextual relationships. In (b), the inner product between input tokens and latent token $L$ is computed. The resultant vector is normalized using softmax to produce context scores $\mathbf{c_s}$. These context scores are used to weight key tokens and produce a context vector $\mathbf{c_v}$, which encodes contextual information.
Figure 5: Context score maps at different output strides (OS) of MobileViTv2 model. Observe how context scores pay attention to semantically relevant image regions. (Left to right: input image, context scores at OS=8, context scores at OS=16, and context scores at OS=32). For more examples and details about context score map generation, see \ref{['sec:app_vis']}.
...and 4 more figures

Separable Self-attention for Mobile Vision Transformers

TL;DR

Abstract

Separable Self-attention for Mobile Vision Transformers

Authors

TL;DR

Abstract

Table of Contents

Figures (9)