Table of Contents
Fetching ...

RepNeXt: A Fast Multi-Scale CNN using Structural Reparameterization

Mingshu Zhao, Yi Luo, Yong Ouyang

TL;DR

RepNeXt presents a lightweight, multi-scale CNN backbone that fuses local convolutional processing with multi-scale global representations through chunk and copy convolutions, aided by structural reparameterization (SRP). The method achieves competitive ImageNet accuracy with significantly lower mobile latency and demonstrates strong transfer to object detection and semantic segmentation. Key contributions include a consistent four-stage architecture, a multi-branch SRP design, and a reparameterized medium-kernel convolution that emulates the human fovea, enabling efficient large-receptive-field modeling without heavy attention mechanisms. This work advances practical mobile vision by offering a simple, efficient backbone that rivals more complex architectures with fewer parameters and lower latency.

Abstract

In the realm of resource-constrained mobile vision tasks, the pursuit of efficiency and performance consistently drives innovation in lightweight Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). While ViTs excel at capturing global context through self-attention mechanisms, their deployment in resource-limited environments is hindered by computational complexity and latency. Conversely, lightweight CNNs are favored for their parameter efficiency and low latency. This study investigates the complementary advantages of CNNs and ViTs to develop a versatile vision backbone tailored for resource-constrained applications. We introduce RepNeXt, a novel model series integrates multi-scale feature representations and incorporates both serial and parallel structural reparameterization (SRP) to enhance network depth and width without compromising inference speed. Extensive experiments demonstrate RepNeXt's superiority over current leading lightweight CNNs and ViTs, providing advantageous latency across various vision benchmarks. RepNeXt-M4 matches RepViT-M1.5's 82.3\% accuracy on ImageNet within 1.5ms on an iPhone 12, outperforms its AP$^{box}$ by 1.3 on MS-COCO, and reduces parameters by 0.7M. Codes and models are available at https://github.com/suous/RepNeXt.

RepNeXt: A Fast Multi-Scale CNN using Structural Reparameterization

TL;DR

RepNeXt presents a lightweight, multi-scale CNN backbone that fuses local convolutional processing with multi-scale global representations through chunk and copy convolutions, aided by structural reparameterization (SRP). The method achieves competitive ImageNet accuracy with significantly lower mobile latency and demonstrates strong transfer to object detection and semantic segmentation. Key contributions include a consistent four-stage architecture, a multi-branch SRP design, and a reparameterized medium-kernel convolution that emulates the human fovea, enabling efficient large-receptive-field modeling without heavy attention mechanisms. This work advances practical mobile vision by offering a simple, efficient backbone that rivals more complex architectures with fewer parameters and lower latency.

Abstract

In the realm of resource-constrained mobile vision tasks, the pursuit of efficiency and performance consistently drives innovation in lightweight Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). While ViTs excel at capturing global context through self-attention mechanisms, their deployment in resource-limited environments is hindered by computational complexity and latency. Conversely, lightweight CNNs are favored for their parameter efficiency and low latency. This study investigates the complementary advantages of CNNs and ViTs to develop a versatile vision backbone tailored for resource-constrained applications. We introduce RepNeXt, a novel model series integrates multi-scale feature representations and incorporates both serial and parallel structural reparameterization (SRP) to enhance network depth and width without compromising inference speed. Extensive experiments demonstrate RepNeXt's superiority over current leading lightweight CNNs and ViTs, providing advantageous latency across various vision benchmarks. RepNeXt-M4 matches RepViT-M1.5's 82.3\% accuracy on ImageNet within 1.5ms on an iPhone 12, outperforms its AP by 1.3 on MS-COCO, and reduces parameters by 0.7M. Codes and models are available at https://github.com/suous/RepNeXt.
Paper Structure (17 sections, 8 equations, 3 figures, 5 tables, 2 algorithms)

This paper contains 17 sections, 8 equations, 3 figures, 5 tables, 2 algorithms.

Figures (3)

  • Figure 1: Latency vs Accuracy Comparison. The top-1 accuracy is tested on ImageNet-1K and the latency is measured by an iPhone 12 with iOS 16 across 20 experimental sets. RepNeXt consistently achieves the best trade-off between performance and latency.
  • Figure 2: (left) The macro architecture of RepNeXt. RepNeXt adopts a four-stage hierarchical design, starting with two $3\times3$ convolutions with a stride of $2$. Where $C_i$ represent channel dimensions at stage $i$, while $H$ and $W$ denote image height and width, respectively. (right) The micro design of MetaNeXt and Downsampling blocks. The MetaNeXt block liu2022convnetyu2023inceptionnext includes a token mixer for spatial feature extraction, a normalization layer for training stability, and a channel mixer for channel information interaction. The token mixer employs a multi-scale reparameterized depthwise convolution, where the medium-kernel branch consists of five different kernel patterns to mimic the central vision enhancement feature of human eyes. The normalization layer is a Batch Normalization batch_norm layer, and the channel mixer comprises a MLP module consists of two $1\times1$ pointwise convolution layers with a GELU hendrycks2016gelu activation function in between. Additionally, the Downsampling layer is a specialized version of the MetaNeXt block with a simplified token mixer.
  • Figure 3: Grad-CAM on the MS-COCO validation dataset for RepViT-M2.3, SwiftFormer-L3, FastViT-SA24 and RepNeXt-M5. RepNeXt captures local details similar to RepViT while providing a global perspective comparable to FastViT.