Table of Contents
Fetching ...

$ShiftwiseConv:$ Small Convolutional Kernel with Large Kernel Effect

Dachong Li, Li Li, Zhuangzhuang Chen, Jianqiang Li

TL;DR

Shiftwise Convolution presents a pure-CNN approach that replaces large kernel convolutions with small $3\times3$ kernels by decoupling long-range dependencies into granular feature extraction and multi-path fusion. Through a shift-based, multi-edge architecture and Re-parameterization with pruning, the method achieves state-of-the-art results across ImageNet, COCO, ADE20K, and nuScenes, often surpassing recent large-kernel CNNs and transformer-based models. Key contributions include showing how to replace large kernels with $3\times3$ convolutions, introducing a plug-and-play SW module, and revealing data-driven sparsity patterns and ERFs. The work highlights that appropriate granularity and diverse connectivity enable CNNs to match or exceed large-kernel attention, with practical impact for efficient, scalable vision models.

Abstract

Large kernels make standard convolutional neural networks (CNNs) great again over transformer architectures in various vision tasks. Nonetheless, recent studies meticulously designed around increasing kernel size have shown diminishing returns or stagnation in performance. Thus, the hidden factors of large kernel convolution that affect model performance remain unexplored. In this paper, we reveal that the key hidden factors of large kernels can be summarized as two separate components: extracting features at a certain granularity and fusing features by multiple pathways. To this end, we leverage the multi-path long-distance sparse dependency relationship to enhance feature utilization via the proposed Shiftwise (SW) convolution operator with a pure CNN architecture. In a wide range of vision tasks such as classification, segmentation, and detection, SW surpasses state-of-the-art transformers and CNN architectures, including SLaK and UniRepLKNet. More importantly, our experiments demonstrate that $3 \times 3$ convolutions can replace large convolutions in existing large kernel CNNs to achieve comparable effects, which may inspire follow-up works. Code and all the models at https://github.com/lidc54/shift-wiseConv.

$ShiftwiseConv:$ Small Convolutional Kernel with Large Kernel Effect

TL;DR

Shiftwise Convolution presents a pure-CNN approach that replaces large kernel convolutions with small kernels by decoupling long-range dependencies into granular feature extraction and multi-path fusion. Through a shift-based, multi-edge architecture and Re-parameterization with pruning, the method achieves state-of-the-art results across ImageNet, COCO, ADE20K, and nuScenes, often surpassing recent large-kernel CNNs and transformer-based models. Key contributions include showing how to replace large kernels with convolutions, introducing a plug-and-play SW module, and revealing data-driven sparsity patterns and ERFs. The work highlights that appropriate granularity and diverse connectivity enable CNNs to match or exceed large-kernel attention, with practical impact for efficient, scalable vision models.

Abstract

Large kernels make standard convolutional neural networks (CNNs) great again over transformer architectures in various vision tasks. Nonetheless, recent studies meticulously designed around increasing kernel size have shown diminishing returns or stagnation in performance. Thus, the hidden factors of large kernel convolution that affect model performance remain unexplored. In this paper, we reveal that the key hidden factors of large kernels can be summarized as two separate components: extracting features at a certain granularity and fusing features by multiple pathways. To this end, we leverage the multi-path long-distance sparse dependency relationship to enhance feature utilization via the proposed Shiftwise (SW) convolution operator with a pure CNN architecture. In a wide range of vision tasks such as classification, segmentation, and detection, SW surpasses state-of-the-art transformers and CNN architectures, including SLaK and UniRepLKNet. More importantly, our experiments demonstrate that convolutions can replace large convolutions in existing large kernel CNNs to achieve comparable effects, which may inspire follow-up works. Code and all the models at https://github.com/lidc54/shift-wiseConv.
Paper Structure (18 sections, 9 figures, 9 tables)

This paper contains 18 sections, 9 figures, 9 tables.

Figures (9)

  • Figure 1: The left panel depicts the retina's cellular structure, which is composed of photoreceptor cells and ganglia, with photoreceptors sending visual signals to ganglia via multiple pathways. The right panel presents our Shiftwise (SW) Convolution, consisting of standard convolution and a connection-centric module. The approach utilizes group convolution and reparameterization (Rep) to extract basic information, which is then processed by a shift algorithm to mimic large kernel convolution.
  • Figure 2: (a) SLaK's large kernel convolution architecture employs a decomposable separable convolution approach using $M \times N$ and $N \times M$ strip convolutions. (b) One-to-many separable convolution with proper feature movement can be equivalent to SLaK's strip convolution. (c) Overview of the proposed shift operation. The addition operation preserves the network's structural integrity even after coarse-grained pruning.
  • Figure 3: (a) Fig. \ref{['fig:Replacement']} shows that the similar structures of the convolutional branches lead to minimal variation across the data manifold. This implies that the branches can be consolidated. The shift operation introduces more divergence in the data manifold, necessitating the repositioning of the Batch Normalization (BN) layer to follow the shift operation for optimal performance. (b) Reducing network width with a ghost-like approach to counterbalance SLaK's expansion.
  • Figure 4: (a) When the equivalent large kernel convolution size is $51\!\times\! 3$ , the coverage regions of the feature maps, along with their utilization proportions, are determined by the areas that can be propagated downward through the shift operation. (b)The utilization ratio of feature maps varies as the number of downward propagation paths increases. $M$ denotes the longer side of the equivalent large kernel convolution, while $w/$ and $w/o$ indicate whether shuffled ordering is applied during the multi-path propagation process. (c) The network structure with increased downward propagation paths.
  • Figure 5: (a) The variation of sparsity with increasing network depth. Colors represent different stages. (b-e) The group convolution of each layer has $\left \lceil \frac{M}{3} \right \rceil$ output channels. The proportion and distribution of the number of removed channel indexes to the total number of groups $nC$. On the four stages, n is [1,2,4,8].
  • ...and 4 more figures