Table of Contents
Fetching ...

PeLK: Parameter-efficient Large Kernel ConvNets with Peripheral Convolution

Honghao Chen, Xiangxiang Chu, Yongjian Ren, Xin Zhao, Kaiqi Huang

TL;DR

This paper proposes a human-like peripheral convolution that efficiently reduces over 90% parameter count of dense grid convolution through parameter sharing, and manages to scale up kernel size to extremely large.

Abstract

Recently, some large kernel convnets strike back with appealing performance and efficiency. However, given the square complexity of convolution, scaling up kernels can bring about an enormous amount of parameters and the proliferated parameters can induce severe optimization problem. Due to these issues, current CNNs compromise to scale up to 51x51 in the form of stripe convolution (i.e., 51x5 + 5x51) and start to saturate as the kernel size continues growing. In this paper, we delve into addressing these vital issues and explore whether we can continue scaling up kernels for more performance gains. Inspired by human vision, we propose a human-like peripheral convolution that efficiently reduces over 90% parameter count of dense grid convolution through parameter sharing, and manage to scale up kernel size to extremely large. Our peripheral convolution behaves highly similar to human, reducing the complexity of convolution from O(K^2) to O(logK) without backfiring performance. Built on this, we propose Parameter-efficient Large Kernel Network (PeLK). Our PeLK outperforms modern vision Transformers and ConvNet architectures like Swin, ConvNeXt, RepLKNet and SLaK on various vision tasks including ImageNet classification, semantic segmentation on ADE20K and object detection on MS COCO. For the first time, we successfully scale up the kernel size of CNNs to an unprecedented 101x101 and demonstrate consistent improvements.

PeLK: Parameter-efficient Large Kernel ConvNets with Peripheral Convolution

TL;DR

This paper proposes a human-like peripheral convolution that efficiently reduces over 90% parameter count of dense grid convolution through parameter sharing, and manages to scale up kernel size to extremely large.

Abstract

Recently, some large kernel convnets strike back with appealing performance and efficiency. However, given the square complexity of convolution, scaling up kernels can bring about an enormous amount of parameters and the proliferated parameters can induce severe optimization problem. Due to these issues, current CNNs compromise to scale up to 51x51 in the form of stripe convolution (i.e., 51x5 + 5x51) and start to saturate as the kernel size continues growing. In this paper, we delve into addressing these vital issues and explore whether we can continue scaling up kernels for more performance gains. Inspired by human vision, we propose a human-like peripheral convolution that efficiently reduces over 90% parameter count of dense grid convolution through parameter sharing, and manage to scale up kernel size to extremely large. Our peripheral convolution behaves highly similar to human, reducing the complexity of convolution from O(K^2) to O(logK) without backfiring performance. Built on this, we propose Parameter-efficient Large Kernel Network (PeLK). Our PeLK outperforms modern vision Transformers and ConvNet architectures like Swin, ConvNeXt, RepLKNet and SLaK on various vision tasks including ImageNet classification, semantic segmentation on ADE20K and object detection on MS COCO. For the first time, we successfully scale up the kernel size of CNNs to an unprecedented 101x101 and demonstrate consistent improvements.
Paper Structure (25 sections, 8 equations, 7 figures, 10 tables)

This paper contains 25 sections, 8 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: (a) Illustration of parameter sharing. Using a 3$\times$3 convolution to parameterize a 5$\times$5 convolution, the positions with the same color share the same parameter. The corresponding sharing grid is $[2,1,2]$. (b) Illustration of peripheral convolution. Our sharing grid contains two designs: i) focus and blur mechanism; ii) exponentially-increasing sharing grid.
  • Figure 2: Comparison under different kernel sizes. We depict the mIoU gains on ADE20K and the multiple of convolutional parameters. Dense grid convolution exceeds stripe convolution consistently but brings rapidly-increasing parameters.
  • Figure 3: Illustration of kernel-wise positional embedding. The position embedding enables the kernel to distinguish specific positions in the sharing region, making up the detail-capturing ability of large kernels.
  • Figure 4: Effective receptive field (ERF) comparison. Our PeLK has larger ERFs than SLaK and RepLK, spreading a wider area.
  • Figure 5: Analysis of FLOPs. (a) FLOPs proportion of head & backbone. (b) FLOPs proportion of backbone's components. The head is UperNet and the backbone is PeLK-T respectively. FLOPs are based on input sizes of (2048, 512).
  • ...and 2 more figures