RepViT: Revisiting Mobile CNN From ViT Perspective
Ao Wang, Hui Chen, Zijia Lin, Jungong Han, Guiguang Ding
TL;DR
RepViT reframes lightweight CNN design by importing ViT-inspired block, macro, and micro strategies into MobileNetV3-L, creating a pure lightweight CNN optimized for mobile latency. Through latency-aware engineering (separating token and channel mixers, reducing FFN expansion, deeper but efficient downsampling, early convolutions, and cross-block SE placement), RepViT achieves state-of-the-art ImageNet accuracy with sub-1 ms on-device latency on iPhone 12 and strong performance in downstream tasks. The work also demonstrates RepViT-SAM's near-10x speed advantage over MobileSAM and competitive zero-shot transfer, highlighting practical deployment benefits for edge vision. Collectively, RepViT shows that carefully adapted ViT-design principles can yield mobile-friendly CNNs that outperform existing lightweight ViTs and CNNs across classification, detection, and segmentation tasks.
Abstract
Recently, lightweight Vision Transformers (ViTs) demonstrate superior performance and lower latency, compared with lightweight Convolutional Neural Networks (CNNs), on resource-constrained mobile devices. Researchers have discovered many structural connections between lightweight ViTs and lightweight CNNs. However, the notable architectural disparities in the block structure, macro, and micro designs between them have not been adequately examined. In this study, we revisit the efficient design of lightweight CNNs from ViT perspective and emphasize their promising prospect for mobile devices. Specifically, we incrementally enhance the mobile-friendliness of a standard lightweight CNN, \ie, MobileNetV3, by integrating the efficient architectural designs of lightweight ViTs. This ends up with a new family of pure lightweight CNNs, namely RepViT. Extensive experiments show that RepViT outperforms existing state-of-the-art lightweight ViTs and exhibits favorable latency in various vision tasks. Notably, on ImageNet, RepViT achieves over 80\% top-1 accuracy with 1.0 ms latency on an iPhone 12, which is the first time for a lightweight model, to the best of our knowledge. Besides, when RepViT meets SAM, our RepViT-SAM can achieve nearly 10$\times$ faster inference than the advanced MobileSAM. Codes and models are available at \url{https://github.com/THU-MIG/RepViT}.
