$ShiftwiseConv:$ Small Convolutional Kernel with Large Kernel Effect
Dachong Li, Li Li, Zhuangzhuang Chen, Jianqiang Li
TL;DR
Shiftwise Convolution presents a pure-CNN approach that replaces large kernel convolutions with small $3\times3$ kernels by decoupling long-range dependencies into granular feature extraction and multi-path fusion. Through a shift-based, multi-edge architecture and Re-parameterization with pruning, the method achieves state-of-the-art results across ImageNet, COCO, ADE20K, and nuScenes, often surpassing recent large-kernel CNNs and transformer-based models. Key contributions include showing how to replace large kernels with $3\times3$ convolutions, introducing a plug-and-play SW module, and revealing data-driven sparsity patterns and ERFs. The work highlights that appropriate granularity and diverse connectivity enable CNNs to match or exceed large-kernel attention, with practical impact for efficient, scalable vision models.
Abstract
Large kernels make standard convolutional neural networks (CNNs) great again over transformer architectures in various vision tasks. Nonetheless, recent studies meticulously designed around increasing kernel size have shown diminishing returns or stagnation in performance. Thus, the hidden factors of large kernel convolution that affect model performance remain unexplored. In this paper, we reveal that the key hidden factors of large kernels can be summarized as two separate components: extracting features at a certain granularity and fusing features by multiple pathways. To this end, we leverage the multi-path long-distance sparse dependency relationship to enhance feature utilization via the proposed Shiftwise (SW) convolution operator with a pure CNN architecture. In a wide range of vision tasks such as classification, segmentation, and detection, SW surpasses state-of-the-art transformers and CNN architectures, including SLaK and UniRepLKNet. More importantly, our experiments demonstrate that $3 \times 3$ convolutions can replace large convolutions in existing large kernel CNNs to achieve comparable effects, which may inspire follow-up works. Code and all the models at https://github.com/lidc54/shift-wiseConv.
