Table of Contents
Fetching ...

EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction

Han Cai, Junyan Li, Muyan Hu, Chuang Gan, Song Han

TL;DR

EfficientViT addresses the high computational cost of high-resolution dense prediction by introducing a multi-scale linear attention mechanism that yields a global receptive field and multi-scale learning with hardware-friendly operations. By substituting heavy softmax attention with ReLU linear attention and augmenting it with local information via depthwise convolutions and multi-scale token aggregation, the approach achieves substantial speedups across semantic segmentation, super-resolution, and Segment Anything on mobile, edge, and cloud hardware while maintaining or improving accuracy. The paper presents EfficientViT as a backbone with flexible sizes and demonstrates strong results on Cityscapes, ADE20K, DIV2K, FFHQ, and ImageNet, including significant latency and throughput gains on diverse platforms. Overall, EfficientViT offers a practical, scalable solution for deploying high-resolution vision models in real-world settings.

Abstract

High-resolution dense prediction enables many appealing real-world applications, such as computational photography, autonomous driving, etc. However, the vast computational cost makes deploying state-of-the-art high-resolution dense prediction models on hardware devices difficult. This work presents EfficientViT, a new family of high-resolution vision models with novel multi-scale linear attention. Unlike prior high-resolution dense prediction models that rely on heavy softmax attention, hardware-inefficient large-kernel convolution, or complicated topology structure to obtain good performances, our multi-scale linear attention achieves the global receptive field and multi-scale learning (two desirable features for high-resolution dense prediction) with only lightweight and hardware-efficient operations. As such, EfficientViT delivers remarkable performance gains over previous state-of-the-art models with significant speedup on diverse hardware platforms, including mobile CPU, edge GPU, and cloud GPU. Without performance loss on Cityscapes, our EfficientViT provides up to 13.9$\times$ and 6.2$\times$ GPU latency reduction over SegFormer and SegNeXt, respectively. For super-resolution, EfficientViT delivers up to 6.4x speedup over Restormer while providing 0.11dB gain in PSNR. For Segment Anything, EfficientViT delivers 48.9x higher throughput on A100 GPU while achieving slightly better zero-shot instance segmentation performance on COCO.

EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction

TL;DR

EfficientViT addresses the high computational cost of high-resolution dense prediction by introducing a multi-scale linear attention mechanism that yields a global receptive field and multi-scale learning with hardware-friendly operations. By substituting heavy softmax attention with ReLU linear attention and augmenting it with local information via depthwise convolutions and multi-scale token aggregation, the approach achieves substantial speedups across semantic segmentation, super-resolution, and Segment Anything on mobile, edge, and cloud hardware while maintaining or improving accuracy. The paper presents EfficientViT as a backbone with flexible sizes and demonstrates strong results on Cityscapes, ADE20K, DIV2K, FFHQ, and ImageNet, including significant latency and throughput gains on diverse platforms. Overall, EfficientViT offers a practical, scalable solution for deploying high-resolution vision models in real-world settings.

Abstract

High-resolution dense prediction enables many appealing real-world applications, such as computational photography, autonomous driving, etc. However, the vast computational cost makes deploying state-of-the-art high-resolution dense prediction models on hardware devices difficult. This work presents EfficientViT, a new family of high-resolution vision models with novel multi-scale linear attention. Unlike prior high-resolution dense prediction models that rely on heavy softmax attention, hardware-inefficient large-kernel convolution, or complicated topology structure to obtain good performances, our multi-scale linear attention achieves the global receptive field and multi-scale learning (two desirable features for high-resolution dense prediction) with only lightweight and hardware-efficient operations. As such, EfficientViT delivers remarkable performance gains over previous state-of-the-art models with significant speedup on diverse hardware platforms, including mobile CPU, edge GPU, and cloud GPU. Without performance loss on Cityscapes, our EfficientViT provides up to 13.9 and 6.2 GPU latency reduction over SegFormer and SegNeXt, respectively. For super-resolution, EfficientViT delivers up to 6.4x speedup over Restormer while providing 0.11dB gain in PSNR. For Segment Anything, EfficientViT delivers 48.9x higher throughput on A100 GPU while achieving slightly better zero-shot instance segmentation performance on COCO.
Paper Structure (24 sections, 4 equations, 7 figures, 7 tables)

This paper contains 24 sections, 4 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Latency/Throughput vs. Performance. All performance results are obtained with the single model and single-scale inference. The GPU latency/throughput results are obtained on one edge GPU (Jetson AGX Orin) and one cloud GPU (A100) using TensorRT and fp16. EfficientViT consistently achieves a remarkable boost in speed on diverse hardware platforms while providing the same/higher performances on Cityscapes, ADE20K, and ImageNet than prior segmentation/classification models.
  • Figure 2: EfficientViT's Building Block (left) and Multi-Scale Linear Attention (right).Left: EfficientViT's building block consists of a multi-scale linear attention module and an FFN with depthwise convolution (FFN+DWConv). Multi-scale linear attention is responsible for capturing context information, while FFN+DWConv captures local information. Right: After getting Q/K/V tokens via the linear projection layer, we generate multi-scale tokens by aggregating nearby tokens via lightweight small-kernel convolutions. ReLU linear attention is applied to multi-scale tokens, and the outputs are concatenated and fed to the final linear projection layer for feature fusing.
  • Figure 3: Softmax Attention vs. ReLU Linear Attention. Unlike softmax attention, ReLU linear attention cannot produce sharp attention distributions due to a lack of the non-linear similarity function. Thus, its local information extraction ability is weaker than the softmax attention.
  • Figure 4: Latency Comparison Between Softmax Attention and ReLU Linear Attention. ReLU linear attention is 3.3-4.5$\times$ faster than softmax attention with similar computation, thanks to removing hardware-unfriendly operations (e.g., softmax). Latency is measured on the Qualcomm Snapdragon 855 CPU with TensorFlow-Lite, batch size 1, and fp32.
  • Figure 5: Macro Architecture of EfficientViT. We adopt the standard backbone-head/encoder-decoder design. We insert our EfficientViT modules in Stages 3 and 4 in the backbone. Following the common practice, we feed the features from the last three stages (P2, P3, and P4) to the head. We use addition to fuse these features for simplicity and efficiency. We adopt a simple head design that consists of several MBConv blocks and output layers.
  • ...and 2 more figures