Table of Contents
Fetching ...

Separable Self-attention for Mobile Vision Transformers

Sachin Mehta, Mohammad Rastegari

TL;DR

The paper tackles the slow, quadratic self-attention bottleneck in mobile vision transformers by introducing separable self-attention with linear complexity $O(k)$ and hardware-friendly element-wise operations. This drop-in replacement for MHA is embedded into MobileViTv2, a lightweight hybrid CNN-ViT model, yielding substantial mobile-speedups while maintaining competitive ImageNet accuracy (75.6% top-1 with ~3M params) and strong performance on segmentation and detection. The approach is validated via extensive ImageNet-1k training, ImageNet-21k-P pretraining, and downstream tasks, showing MobileViTv2 outperforms MobileViTv1 and rivals or surpasses other lightweight architectures, especially on mobile devices. The work demonstrates that carefully designed, low-communication-attention mechanisms can close the latency gap between CNNs and ViTs for mobile applications, with publicly released code for reproducibility.

Abstract

Mobile vision transformers (MobileViT) can achieve state-of-the-art performance across several mobile vision tasks, including classification and detection. Though these models have fewer parameters, they have high latency as compared to convolutional neural network-based models. The main efficiency bottleneck in MobileViT is the multi-headed self-attention (MHA) in transformers, which requires $O(k^2)$ time complexity with respect to the number of tokens (or patches) $k$. Moreover, MHA requires costly operations (e.g., batch-wise matrix multiplication) for computing self-attention, impacting latency on resource-constrained devices. This paper introduces a separable self-attention method with linear complexity, i.e. $O(k)$. A simple yet effective characteristic of the proposed method is that it uses element-wise operations for computing self-attention, making it a good choice for resource-constrained devices. The improved model, MobileViTv2, is state-of-the-art on several mobile vision tasks, including ImageNet object classification and MS-COCO object detection. With about three million parameters, MobileViTv2 achieves a top-1 accuracy of 75.6% on the ImageNet dataset, outperforming MobileViT by about 1% while running $3.2\times$ faster on a mobile device. Our source code is available at: \url{https://github.com/apple/ml-cvnets}

Separable Self-attention for Mobile Vision Transformers

TL;DR

The paper tackles the slow, quadratic self-attention bottleneck in mobile vision transformers by introducing separable self-attention with linear complexity and hardware-friendly element-wise operations. This drop-in replacement for MHA is embedded into MobileViTv2, a lightweight hybrid CNN-ViT model, yielding substantial mobile-speedups while maintaining competitive ImageNet accuracy (75.6% top-1 with ~3M params) and strong performance on segmentation and detection. The approach is validated via extensive ImageNet-1k training, ImageNet-21k-P pretraining, and downstream tasks, showing MobileViTv2 outperforms MobileViTv1 and rivals or surpasses other lightweight architectures, especially on mobile devices. The work demonstrates that carefully designed, low-communication-attention mechanisms can close the latency gap between CNNs and ViTs for mobile applications, with publicly released code for reproducibility.

Abstract

Mobile vision transformers (MobileViT) can achieve state-of-the-art performance across several mobile vision tasks, including classification and detection. Though these models have fewer parameters, they have high latency as compared to convolutional neural network-based models. The main efficiency bottleneck in MobileViT is the multi-headed self-attention (MHA) in transformers, which requires time complexity with respect to the number of tokens (or patches) . Moreover, MHA requires costly operations (e.g., batch-wise matrix multiplication) for computing self-attention, impacting latency on resource-constrained devices. This paper introduces a separable self-attention method with linear complexity, i.e. . A simple yet effective characteristic of the proposed method is that it uses element-wise operations for computing self-attention, making it a good choice for resource-constrained devices. The improved model, MobileViTv2, is state-of-the-art on several mobile vision tasks, including ImageNet object classification and MS-COCO object detection. With about three million parameters, MobileViTv2 achieves a top-1 accuracy of 75.6% on the ImageNet dataset, outperforming MobileViT by about 1% while running faster on a mobile device. Our source code is available at: \url{https://github.com/apple/ml-cvnets}
Paper Structure (35 sections, 3 equations, 9 figures, 10 tables)

This paper contains 35 sections, 3 equations, 9 figures, 10 tables.

Figures (9)

  • Figure 1: Comparison between different attention units. Transformer and Linformer use costly operations (batch-wise matrix multiplication) for computing self-attention. Such operations are a bottleneck for efficient inference on resource-constrained devices. The proposed method does not use such operations, thus accelerating inference on resource-constrained devices. Left compares top-5 operations (sorted by CPU time) in a single layer of different attention units for $k=256$ tokens. Top Right compares complexity of different attention units. Bottom Right compares the latency of different attention units as a function of the number of tokens $k$. These results are computed on a single CPU core machine with a 2.4 GHz 8-Core Intel Core i9 processor, $d=512$ (token dimensionality), $h=8$ (number of heads; for Transformer and Linformer), and $p=256$ (projected tokens in Linformer) using a publicly available profiler in PyTorch paszke2019pytorch.
  • Figure 2: MobileViTv2 models are faster and better than MobileViTv1 models mehta2022mobilevit across different tasks.MobileViTv2 models are constructed by replacing multi-headed self-attention in MobileViTv1 with the proposed separable self-attention (\ref{['ssec:separable_self_attn']}). Here, inference time is measured on an iPhone12 for an input resolution of $256 \times 256$, $512 \times 512$, and $320 \times 320$ for classification, segmentation, and detection respectively.
  • Figure 3: Different self-attention units.(a) is a standard multi-headed self-attention (MHA) in transformers. (b) extends MHA in (a) by introducing token projection layers, which project $k$ tokens to a pre-defined number of tokens $p$, thus reducing the complexity from $O(k^2)$ to $O(k)$. However, it still uses costly operations (e.g., batch-wise matrix multiplication) for computing self-attention, impacting latency on resource-constrained devices (\ref{['fig:compare_different_attn_layers']}). (c) is the proposed separable self-attention layer that is linear in complexity, i.e., $O(k)$, and uses element-wise operations for faster inference.
  • Figure 4: Example illustrating the interaction between tokens to learn global representations in different attention layers. In (a), each query token computes the distance with all key tokens via dot-product. These distances are then normalized using softmax to produce an attention matrix $\mathbf{a}$, which encodes contextual relationships. In (b), the inner product between input tokens and latent token $L$ is computed. The resultant vector is normalized using softmax to produce context scores $\mathbf{c_s}$. These context scores are used to weight key tokens and produce a context vector $\mathbf{c_v}$, which encodes contextual information.
  • Figure 5: Context score maps at different output strides (OS) of MobileViTv2 model. Observe how context scores pay attention to semantically relevant image regions. (Left to right: input image, context scores at OS=8, context scores at OS=16, and context scores at OS=32). For more examples and details about context score map generation, see \ref{['sec:app_vis']}.
  • ...and 4 more figures