Table of Contents
Fetching ...

RepVGG: Making VGG-style ConvNets Great Again

Xiaohan Ding, Xiangyu Zhang, Ningning Ma, Jungong Han, Guiguang Ding, Jian Sun

TL;DR

RepVGG tackles the speed-accuracy dilemma by decoupling training-time optimization from inference-time structure using structural re-parameterization. A training-time multi-branch block (including identity and 1×1 branches) becomes a single 3×3 conv stack for deployment, yielding a simple, fast, and memory-efficient plain architecture. On ImageNet, RepVGG surpasses ResNets and competes with state-of-the-art models, while also improving Cityscapes semantic segmentation backbones; ablations validate the necessity of the re-parameterization and BN placement. The work emphasizes hardware-friendly design and offers a practical path to high-performance plain ConvNets without heavy architecture search.

Abstract

We present a simple but powerful architecture of convolutional neural network, which has a VGG-like inference-time body composed of nothing but a stack of 3x3 convolution and ReLU, while the training-time model has a multi-branch topology. Such decoupling of the training-time and inference-time architecture is realized by a structural re-parameterization technique so that the model is named RepVGG. On ImageNet, RepVGG reaches over 80% top-1 accuracy, which is the first time for a plain model, to the best of our knowledge. On NVIDIA 1080Ti GPU, RepVGG models run 83% faster than ResNet-50 or 101% faster than ResNet-101 with higher accuracy and show favorable accuracy-speed trade-off compared to the state-of-the-art models like EfficientNet and RegNet. The code and trained models are available at https://github.com/megvii-model/RepVGG.

RepVGG: Making VGG-style ConvNets Great Again

TL;DR

RepVGG tackles the speed-accuracy dilemma by decoupling training-time optimization from inference-time structure using structural re-parameterization. A training-time multi-branch block (including identity and 1×1 branches) becomes a single 3×3 conv stack for deployment, yielding a simple, fast, and memory-efficient plain architecture. On ImageNet, RepVGG surpasses ResNets and competes with state-of-the-art models, while also improving Cityscapes semantic segmentation backbones; ablations validate the necessity of the re-parameterization and BN placement. The work emphasizes hardware-friendly design and offers a practical path to high-performance plain ConvNets without heavy architecture search.

Abstract

We present a simple but powerful architecture of convolutional neural network, which has a VGG-like inference-time body composed of nothing but a stack of 3x3 convolution and ReLU, while the training-time model has a multi-branch topology. Such decoupling of the training-time and inference-time architecture is realized by a structural re-parameterization technique so that the model is named RepVGG. On ImageNet, RepVGG reaches over 80% top-1 accuracy, which is the first time for a plain model, to the best of our knowledge. On NVIDIA 1080Ti GPU, RepVGG models run 83% faster than ResNet-50 or 101% faster than ResNet-101 with higher accuracy and show favorable accuracy-speed trade-off compared to the state-of-the-art models like EfficientNet and RegNet. The code and trained models are available at https://github.com/megvii-model/RepVGG.

Paper Structure

This paper contains 17 sections, 4 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: Top-1 accuracy on ImageNet vs. actual speed. Left: lightweight and middleweight RepVGG and baselines trained in 120 epochs. Right: heavyweight models trained in 200 epochs. The speed is tested on the same 1080Ti with a batch size of 128, full precision (fp32), single crop, and measured in examples/second. The input resolution is 300 for EfficientNet-B3 efficientnet and 224 for the others.
  • Figure 2: Sketch of RepVGG architecture. RepVGG has 5 stages and conducts down-sampling via stride-2 convolution at the beginning of a stage. Here we only show the first 4 layers of a specific stage. As inspired by ResNet he2016deep, we also use identity and $1\times1$ branches, but only for training.
  • Figure 3: Peak memory occupation in residual and plain model. If the residual block maintains the size of feature map, the peak value of extra memory occupied by feature maps will be $2\times$ as the input. The memory occupied by the parameters is small compared to the features hence ignored.
  • Figure 4: Structural re-parameterization of a RepVGG block. For the ease of visualization, we assume $C_2=C_1=2$, thus the $3\times3$ layer has four $3\times3$ matrices and the kernel of $1\times1$ layer is a $2\times2$ matrix.