Table of Contents
Fetching ...

Next-ViT: Next Generation Vision Transformer for Efficient Deployment in Realistic Industrial Scenarios

Jiashi Li, Xin Xia, Wei Li, Huixia Li, Xing Wang, Xuefeng Xiao, Rui Wang, Min Zheng, Xin Pan

TL;DR

Next-ViT introduces deployment-friendly CNN-Transformer hybrids by designing Next Convolution Block (NCB) and Next Transformer Block (NTB) and combining them with a novel Next Hybrid Strategy (NHS) to achieve superior latency/accuracy on industrial hardware. The hierarchical architecture, frequency-aware NTB, and efficient MHCA in NCB enable strong local/global feature fusion while maintaining low latency on TensorRT and CoreML. Extensive experiments across ImageNet-1K, ADE20K, and COCO demonstrate notable gains over CNNs, ViTs, and prior hybrids, highlighting practical impact for real-world deployment. The work also provides ablations, visualizations, and code to facilitate adoption and further research in hardware-efficient vision models.

Abstract

Due to the complex attention mechanisms and model design, most existing vision Transformers (ViTs) can not perform as efficiently as convolutional neural networks (CNNs) in realistic industrial deployment scenarios, e.g. TensorRT and CoreML. This poses a distinct challenge: Can a visual neural network be designed to infer as fast as CNNs and perform as powerful as ViTs? Recent works have tried to design CNN-Transformer hybrid architectures to address this issue, yet the overall performance of these works is far away from satisfactory. To end these, we propose a next generation vision Transformer for efficient deployment in realistic industrial scenarios, namely Next-ViT, which dominates both CNNs and ViTs from the perspective of latency/accuracy trade-off. In this work, the Next Convolution Block (NCB) and Next Transformer Block (NTB) are respectively developed to capture local and global information with deployment-friendly mechanisms. Then, Next Hybrid Strategy (NHS) is designed to stack NCB and NTB in an efficient hybrid paradigm, which boosts performance in various downstream tasks. Extensive experiments show that Next-ViT significantly outperforms existing CNNs, ViTs and CNN-Transformer hybrid architectures with respect to the latency/accuracy trade-off across various vision tasks. On TensorRT, Next-ViT surpasses ResNet by 5.5 mAP (from 40.4 to 45.9) on COCO detection and 7.7% mIoU (from 38.8% to 46.5%) on ADE20K segmentation under similar latency. Meanwhile, it achieves comparable performance with CSWin, while the inference speed is accelerated by 3.6x. On CoreML, Next-ViT surpasses EfficientFormer by 4.6 mAP (from 42.6 to 47.2) on COCO detection and 3.5% mIoU (from 45.1% to 48.6%) on ADE20K segmentation under similar latency. Our code and models are made public at: https://github.com/bytedance/Next-ViT

Next-ViT: Next Generation Vision Transformer for Efficient Deployment in Realistic Industrial Scenarios

TL;DR

Next-ViT introduces deployment-friendly CNN-Transformer hybrids by designing Next Convolution Block (NCB) and Next Transformer Block (NTB) and combining them with a novel Next Hybrid Strategy (NHS) to achieve superior latency/accuracy on industrial hardware. The hierarchical architecture, frequency-aware NTB, and efficient MHCA in NCB enable strong local/global feature fusion while maintaining low latency on TensorRT and CoreML. Extensive experiments across ImageNet-1K, ADE20K, and COCO demonstrate notable gains over CNNs, ViTs, and prior hybrids, highlighting practical impact for real-world deployment. The work also provides ablations, visualizations, and code to facilitate adoption and further research in hardware-efficient vision models.

Abstract

Due to the complex attention mechanisms and model design, most existing vision Transformers (ViTs) can not perform as efficiently as convolutional neural networks (CNNs) in realistic industrial deployment scenarios, e.g. TensorRT and CoreML. This poses a distinct challenge: Can a visual neural network be designed to infer as fast as CNNs and perform as powerful as ViTs? Recent works have tried to design CNN-Transformer hybrid architectures to address this issue, yet the overall performance of these works is far away from satisfactory. To end these, we propose a next generation vision Transformer for efficient deployment in realistic industrial scenarios, namely Next-ViT, which dominates both CNNs and ViTs from the perspective of latency/accuracy trade-off. In this work, the Next Convolution Block (NCB) and Next Transformer Block (NTB) are respectively developed to capture local and global information with deployment-friendly mechanisms. Then, Next Hybrid Strategy (NHS) is designed to stack NCB and NTB in an efficient hybrid paradigm, which boosts performance in various downstream tasks. Extensive experiments show that Next-ViT significantly outperforms existing CNNs, ViTs and CNN-Transformer hybrid architectures with respect to the latency/accuracy trade-off across various vision tasks. On TensorRT, Next-ViT surpasses ResNet by 5.5 mAP (from 40.4 to 45.9) on COCO detection and 7.7% mIoU (from 38.8% to 46.5%) on ADE20K segmentation under similar latency. Meanwhile, it achieves comparable performance with CSWin, while the inference speed is accelerated by 3.6x. On CoreML, Next-ViT surpasses EfficientFormer by 4.6 mAP (from 42.6 to 47.2) on COCO detection and 3.5% mIoU (from 45.1% to 48.6%) on ADE20K segmentation under similar latency. Our code and models are made public at: https://github.com/bytedance/Next-ViT
Paper Structure (26 sections, 7 equations, 5 figures, 9 tables)

This paper contains 26 sections, 7 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Comparison among Next-ViT and efficient Networks, in terms of accuracy-latency trade-off.
  • Figure 2: The left column is the overall hierarchical architecture of Next-ViT. The medium column are the Next Convolution Block (NCB) and the Next Transformer Block (NTB). The right column are the detailed visualization of multi-head convolutional attention (MHCA), efficient multi-head self-attention (E-MHSA) and the optimized MLP modules.
  • Figure 3: Comparison of different Transformer-based and convolution-based blocks.
  • Figure 4: Comparison of traditional hybrid strategies and NHS.
  • Figure 5: (a) Fourier spectrum of ResNet ResNet, Swin Swin and Next-ViT. (b) Heat maps of the output feature from ResNet ResNet, Swin Swin and Next-ViT.