Revealing the Dark Secrets of Extremely Large Kernel ConvNets on Robustness
Honghao Chen, Yurong Zhang, Xiaokun Feng, Xiangxiang Chu, Kaiqi Huang
TL;DR
The paper tackles the question of whether extremely large kernel convolutional networks (CNNs) can achieve robustness comparable to Vision Transformers (ViTs). It adopts RepLKNet as the primary large-kernel model and conducts a thorough robustness assessment across six benchmarks, including natural adversarial, corruptions, out-of-domain, perturbations, semantic shifts, and background dependency, while comparing to ViTs and typical CNNs. Through nine quantitative and qualitative experiments, the study identifies occlusion invariance, kernel attention patterns, and frequency-domain behavior as key factors driving robustness, and demonstrates that large kernels plus hybrid local-global attention yield stable feature representations and resilience to adversaries and perturbations. The findings suggest that pure CNN architectures with extremely large kernels can rival ViTs in robustness, offering practical implications for deploying reliable vision systems and reviving CNN-based designs in robustness-critical applications. Overall, the work provides empirical evidence and mechanistic insights into the robustness of large-kernel CNNs, highlighting their potential to complement or surpass ViTs in real-world scenarios.
Abstract
Robustness is a vital aspect to consider when deploying deep learning models into the wild. Numerous studies have been dedicated to the study of the robustness of vision transformers (ViTs), which have dominated as the mainstream backbone choice for vision tasks since the dawn of 2020s. Recently, some large kernel convnets make a comeback with impressive performance and efficiency. However, it still remains unclear whether large kernel networks are robust and the attribution of their robustness. In this paper, we first conduct a comprehensive evaluation of large kernel convnets' robustness and their differences from typical small kernel counterparts and ViTs on six diverse robustness benchmark datasets. Then to analyze the underlying factors behind their strong robustness, we design experiments from both quantitative and qualitative perspectives to reveal large kernel convnets' intriguing properties that are completely different from typical convnets. Our experiments demonstrate for the first time that pure CNNs can achieve exceptional robustness comparable or even superior to that of ViTs. Our analysis on occlusion invariance, kernel attention patterns and frequency characteristics provide novel insights into the source of robustness.
