Table of Contents
Fetching ...

RepViT: Revisiting Mobile CNN From ViT Perspective

Ao Wang, Hui Chen, Zijia Lin, Jungong Han, Guiguang Ding

TL;DR

RepViT reframes lightweight CNN design by importing ViT-inspired block, macro, and micro strategies into MobileNetV3-L, creating a pure lightweight CNN optimized for mobile latency. Through latency-aware engineering (separating token and channel mixers, reducing FFN expansion, deeper but efficient downsampling, early convolutions, and cross-block SE placement), RepViT achieves state-of-the-art ImageNet accuracy with sub-1 ms on-device latency on iPhone 12 and strong performance in downstream tasks. The work also demonstrates RepViT-SAM's near-10x speed advantage over MobileSAM and competitive zero-shot transfer, highlighting practical deployment benefits for edge vision. Collectively, RepViT shows that carefully adapted ViT-design principles can yield mobile-friendly CNNs that outperform existing lightweight ViTs and CNNs across classification, detection, and segmentation tasks.

Abstract

Recently, lightweight Vision Transformers (ViTs) demonstrate superior performance and lower latency, compared with lightweight Convolutional Neural Networks (CNNs), on resource-constrained mobile devices. Researchers have discovered many structural connections between lightweight ViTs and lightweight CNNs. However, the notable architectural disparities in the block structure, macro, and micro designs between them have not been adequately examined. In this study, we revisit the efficient design of lightweight CNNs from ViT perspective and emphasize their promising prospect for mobile devices. Specifically, we incrementally enhance the mobile-friendliness of a standard lightweight CNN, \ie, MobileNetV3, by integrating the efficient architectural designs of lightweight ViTs. This ends up with a new family of pure lightweight CNNs, namely RepViT. Extensive experiments show that RepViT outperforms existing state-of-the-art lightweight ViTs and exhibits favorable latency in various vision tasks. Notably, on ImageNet, RepViT achieves over 80\% top-1 accuracy with 1.0 ms latency on an iPhone 12, which is the first time for a lightweight model, to the best of our knowledge. Besides, when RepViT meets SAM, our RepViT-SAM can achieve nearly 10$\times$ faster inference than the advanced MobileSAM. Codes and models are available at \url{https://github.com/THU-MIG/RepViT}.

RepViT: Revisiting Mobile CNN From ViT Perspective

TL;DR

RepViT reframes lightweight CNN design by importing ViT-inspired block, macro, and micro strategies into MobileNetV3-L, creating a pure lightweight CNN optimized for mobile latency. Through latency-aware engineering (separating token and channel mixers, reducing FFN expansion, deeper but efficient downsampling, early convolutions, and cross-block SE placement), RepViT achieves state-of-the-art ImageNet accuracy with sub-1 ms on-device latency on iPhone 12 and strong performance in downstream tasks. The work also demonstrates RepViT-SAM's near-10x speed advantage over MobileSAM and competitive zero-shot transfer, highlighting practical deployment benefits for edge vision. Collectively, RepViT shows that carefully adapted ViT-design principles can yield mobile-friendly CNNs that outperform existing lightweight ViTs and CNNs across classification, detection, and segmentation tasks.

Abstract

Recently, lightweight Vision Transformers (ViTs) demonstrate superior performance and lower latency, compared with lightweight Convolutional Neural Networks (CNNs), on resource-constrained mobile devices. Researchers have discovered many structural connections between lightweight ViTs and lightweight CNNs. However, the notable architectural disparities in the block structure, macro, and micro designs between them have not been adequately examined. In this study, we revisit the efficient design of lightweight CNNs from ViT perspective and emphasize their promising prospect for mobile devices. Specifically, we incrementally enhance the mobile-friendliness of a standard lightweight CNN, \ie, MobileNetV3, by integrating the efficient architectural designs of lightweight ViTs. This ends up with a new family of pure lightweight CNNs, namely RepViT. Extensive experiments show that RepViT outperforms existing state-of-the-art lightweight ViTs and exhibits favorable latency in various vision tasks. Notably, on ImageNet, RepViT achieves over 80\% top-1 accuracy with 1.0 ms latency on an iPhone 12, which is the first time for a lightweight model, to the best of our knowledge. Besides, when RepViT meets SAM, our RepViT-SAM can achieve nearly 10 faster inference than the advanced MobileSAM. Codes and models are available at \url{https://github.com/THU-MIG/RepViT}.
Paper Structure (32 sections, 4 figures, 7 tables)

This paper contains 32 sections, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Comparison of latency and accuracy between RepViT (Ours) and others. The top-1 accuracy is tested on ImageNet-1K and the latency is measured by iPhone 12 with iOS 16. RepViT achieves the best trade-off between performance and latency.
  • Figure 2: We modernize MobileNetV3-L from various granularities. We mainly consider the latency on mobile devices and the top-1 accuracy on ImageNet-1K. Finally, we obtain a new family of pure lightweight CNNs, namely RepViT, which can achieve lower latency and higher performance. Note that these results are obtained without the distillation.
  • Figure 3: Block design. (a) is a MobileNetV3 block with an optional squeeze-and-excitation (SE) layer. (b) is the designed RepViT block, which separates the token mixer and channel mixer through the structural re-parameterization technique. The SE layer is also optional in RepViT block. The norm layer and nonlinearity are omitted for simplicity.
  • Figure 4: Macro design. (a) and (b), (c) and (d), (e) and (f) indicate the designs for stem, downsampling layer and classifier, respectively. RepViT has four stages with $\frac{H}{4}\times \frac{W}{4}$, $\frac{H}{8}\times \frac{W}{8}$, $\frac{H}{16}\times \frac{W}{16}$ and $\frac{H}{32}\times \frac{W}{32}$ resolutions respectively, where $H$ and $W$ denote the width and height of the input image. $C$ represents the channel dimension and $B$ denotes the batch size. The norm layer and nonlinearity are omitted.