Table of Contents
Fetching ...

Scaling Up Your Kernels: Large Kernel Design in ConvNets towards Universal Representations

Yiyuan Zhang, Xiaohan Ding, Xiangyu Yue

Abstract

This paper proposes the paradigm of large convolutional kernels in designing modern Convolutional Neural Networks (ConvNets). We establish that employing a few large kernels, instead of stacking multiple smaller ones, can be a superior design strategy. Our work introduces a set of architecture design guidelines for large-kernel ConvNets that optimize their efficiency and performance. We propose the UniRepLKNet architecture, which offers systematical architecture design principles specifically crafted for large-kernel ConvNets, emphasizing their unique ability to capture extensive spatial information without deep layer stacking. This results in a model that not only surpasses its predecessors with an ImageNet accuracy of 88.0%, an ADE20K mIoU of 55.6%, and a COCO box AP of 56.4% but also demonstrates impressive scalability and performance on various modalities such as time-series forecasting, audio, point cloud, and video recognition. These results indicate the universal modeling abilities of large-kernel ConvNets with faster inference speed compared with vision transformers. Our findings reveal that large-kernel ConvNets possess larger effective receptive fields and a higher shape bias, moving away from the texture bias typical of smaller-kernel CNNs. All codes and models are publicly available at https://github.com/AILab-CVC/UniRepLKNet promoting further research and development in the community.

Scaling Up Your Kernels: Large Kernel Design in ConvNets towards Universal Representations

Abstract

This paper proposes the paradigm of large convolutional kernels in designing modern Convolutional Neural Networks (ConvNets). We establish that employing a few large kernels, instead of stacking multiple smaller ones, can be a superior design strategy. Our work introduces a set of architecture design guidelines for large-kernel ConvNets that optimize their efficiency and performance. We propose the UniRepLKNet architecture, which offers systematical architecture design principles specifically crafted for large-kernel ConvNets, emphasizing their unique ability to capture extensive spatial information without deep layer stacking. This results in a model that not only surpasses its predecessors with an ImageNet accuracy of 88.0%, an ADE20K mIoU of 55.6%, and a COCO box AP of 56.4% but also demonstrates impressive scalability and performance on various modalities such as time-series forecasting, audio, point cloud, and video recognition. These results indicate the universal modeling abilities of large-kernel ConvNets with faster inference speed compared with vision transformers. Our findings reveal that large-kernel ConvNets possess larger effective receptive fields and a higher shape bias, moving away from the texture bias typical of smaller-kernel CNNs. All codes and models are publicly available at https://github.com/AILab-CVC/UniRepLKNet promoting further research and development in the community.

Paper Structure

This paper contains 21 sections, 7 equations, 11 figures, 18 tables.

Figures (11)

  • Figure 1: UniRepLKNet models learn universal representation across multiple modalities. Regarding precision and efficiency across image, audio, point Cloud, and time-series modalities, UniRepLKNet delivers better scaling abilities between performance and computation burdens. The latency is tested with an A100 GPU, batch size of 128, and full precision (fp32).
  • Figure 2: The Effective Receptive Field (ERF) of ResNet-50/101/152 and the large kernel (K) variants of ResNets, respectively. A more widely distributed dark area indicates a larger ERF. More layers (e.g., from ResNet-101 to ResNet-152) help little in enlarging ERFs. Instead, the large-kernel ConvNets effectively obtain large ERFs.
  • Figure 3: Architectural design of UniRepLKNet. A LarK Block comprises a Dilated Reparam Block proposed in this paper, an SE Block hu2018squeeze, an FFN, and Batch Normalization (BN) ioffe2015batch layers. The only difference between a SmaK Block and a LarK Block is that the former uses a depth-wise 3$\times$3 conv layer in replacement of the Dilated Reparam Block in the latter. Stages are connected by down-sampling blocks implemented by stride-2 dense 3$\times$3 conv layers. We may flexibly arrange the blocks in different stages and the details of our provided instances are shown in Table \ref{['table-instances']}.
  • Figure 4: An example of re-parameterizing a small kernel (e.g., 3$\times$3) in Table \ref{['table-mob2-reparam']} into a large one (e.g., 7$\times$7). We use the structural re-parameterization as previous practices ding2019acnetding2021repvgg.
  • Figure 5: Illustration to convolution with small feature map and large kernel. Two outputs at adjacent locations only share a part of kernel weights. Translational equivariance does not strictly hold.
  • ...and 6 more figures