Table of Contents
Fetching ...

Revealing the Dark Secrets of Extremely Large Kernel ConvNets on Robustness

Honghao Chen, Yurong Zhang, Xiaokun Feng, Xiangxiang Chu, Kaiqi Huang

TL;DR

The paper tackles the question of whether extremely large kernel convolutional networks (CNNs) can achieve robustness comparable to Vision Transformers (ViTs). It adopts RepLKNet as the primary large-kernel model and conducts a thorough robustness assessment across six benchmarks, including natural adversarial, corruptions, out-of-domain, perturbations, semantic shifts, and background dependency, while comparing to ViTs and typical CNNs. Through nine quantitative and qualitative experiments, the study identifies occlusion invariance, kernel attention patterns, and frequency-domain behavior as key factors driving robustness, and demonstrates that large kernels plus hybrid local-global attention yield stable feature representations and resilience to adversaries and perturbations. The findings suggest that pure CNN architectures with extremely large kernels can rival ViTs in robustness, offering practical implications for deploying reliable vision systems and reviving CNN-based designs in robustness-critical applications. Overall, the work provides empirical evidence and mechanistic insights into the robustness of large-kernel CNNs, highlighting their potential to complement or surpass ViTs in real-world scenarios.

Abstract

Robustness is a vital aspect to consider when deploying deep learning models into the wild. Numerous studies have been dedicated to the study of the robustness of vision transformers (ViTs), which have dominated as the mainstream backbone choice for vision tasks since the dawn of 2020s. Recently, some large kernel convnets make a comeback with impressive performance and efficiency. However, it still remains unclear whether large kernel networks are robust and the attribution of their robustness. In this paper, we first conduct a comprehensive evaluation of large kernel convnets' robustness and their differences from typical small kernel counterparts and ViTs on six diverse robustness benchmark datasets. Then to analyze the underlying factors behind their strong robustness, we design experiments from both quantitative and qualitative perspectives to reveal large kernel convnets' intriguing properties that are completely different from typical convnets. Our experiments demonstrate for the first time that pure CNNs can achieve exceptional robustness comparable or even superior to that of ViTs. Our analysis on occlusion invariance, kernel attention patterns and frequency characteristics provide novel insights into the source of robustness.

Revealing the Dark Secrets of Extremely Large Kernel ConvNets on Robustness

TL;DR

The paper tackles the question of whether extremely large kernel convolutional networks (CNNs) can achieve robustness comparable to Vision Transformers (ViTs). It adopts RepLKNet as the primary large-kernel model and conducts a thorough robustness assessment across six benchmarks, including natural adversarial, corruptions, out-of-domain, perturbations, semantic shifts, and background dependency, while comparing to ViTs and typical CNNs. Through nine quantitative and qualitative experiments, the study identifies occlusion invariance, kernel attention patterns, and frequency-domain behavior as key factors driving robustness, and demonstrates that large kernels plus hybrid local-global attention yield stable feature representations and resilience to adversaries and perturbations. The findings suggest that pure CNN architectures with extremely large kernels can rival ViTs in robustness, offering practical implications for deploying reliable vision systems and reviving CNN-based designs in robustness-critical applications. Overall, the work provides empirical evidence and mechanistic insights into the robustness of large-kernel CNNs, highlighting their potential to complement or surpass ViTs in real-world scenarios.

Abstract

Robustness is a vital aspect to consider when deploying deep learning models into the wild. Numerous studies have been dedicated to the study of the robustness of vision transformers (ViTs), which have dominated as the mainstream backbone choice for vision tasks since the dawn of 2020s. Recently, some large kernel convnets make a comeback with impressive performance and efficiency. However, it still remains unclear whether large kernel networks are robust and the attribution of their robustness. In this paper, we first conduct a comprehensive evaluation of large kernel convnets' robustness and their differences from typical small kernel counterparts and ViTs on six diverse robustness benchmark datasets. Then to analyze the underlying factors behind their strong robustness, we design experiments from both quantitative and qualitative perspectives to reveal large kernel convnets' intriguing properties that are completely different from typical convnets. Our experiments demonstrate for the first time that pure CNNs can achieve exceptional robustness comparable or even superior to that of ViTs. Our analysis on occlusion invariance, kernel attention patterns and frequency characteristics provide novel insights into the source of robustness.
Paper Structure (20 sections, 10 figures, 8 tables)

This paper contains 20 sections, 10 figures, 8 tables.

Figures (10)

  • Figure 1: Model configurations. We depict the model size and corresponding ImageNet-1k top-1 accuracy. We choose models with similar accuracy and parameter counts (except for resnet-50 as a baseline). All the reported variants were pre-trained on ImageNet-21k and then fine-tuned on ImageNet-1k.
  • Figure 2: Comparison on ImageNet-A, ImageNet-C, and ImageNet-O. For ImageNet-A, we report top-1 accuracy; For ImageNet-C, we report mean top-1 accuracy over all the 19 corruptions; For ImageNet-O, we report area under the precision-recall curve (AUPR). Note that for all the metrics higher is better. RepLKNet-31B performs on par with or even better than ViT-L on ImageNet-A and ImageNet-O, demonstrating the strong robustness of large kernel convnets.
  • Figure 3: Accuracy comparison on ImageNet-R. We report top-1 accuracy for all the model variants. RepLKNet-31B still outperforms ViT-B/L by large margins.
  • Figure 4: Occlusion robustness comparison under different settings. We use random drop, salient drop and non-salient drop to evaluate corresponding occlusion robustness. We report top-1 accuracy drop under different information loss ratios, ranging from 10% to 90%. RepLKNet are more robust to extreme occlusion scenarios and more importantly, they outperform ViT for salient occlusion remarkably.
  • Figure 5: Illustration of patch drop. We depict an example image of different occlusion types: random, salient and non-salient. The pixel values in the occluded (black) areas are assigned to be zero.
  • ...and 5 more figures