Table of Contents
Fetching ...

CAM Back Again: Large Kernel CNNs from a Weakly Supervised Object Localization Perspective

Shunsuke Yasuki, Masato Taki

TL;DR

This work challenges the prevailing view that enlarging receptive fields (ERFs) is the primary driver of strong downstream performance in large-kernel CNNs. By evaluating ConvNeXt, RepLKNet, and SLaK on weakly supervised object localization (WSOL) with the classic CAM approach, the authors show that improved feature maps and architectural properties underpin localization quality far more than ERF size. They demonstrate that modern backbones can mitigate CAM limitations, producing global object activation and thereby achieving strong WSOL scores even with simple CAM and data augmentation. A key finding is that a PC1-based localization approach can surpass state-of-the-art CNN-based WSOL, highlighting the potential of feature-map-driven improvements for localization tasks. The results advocate focusing on architectural design and feature-map quality to advance WSOL and related localization applications.

Abstract

Recently, convolutional neural networks (CNNs) with large size kernels have attracted much attention in the computer vision field, following the success of the Vision Transformers. Large kernel CNNs have been reported to perform well in downstream vision tasks as well as in classification performance. The reason for the high-performance of large kernel CNNs in downstream tasks has been attributed to the large effective receptive field (ERF) produced by large size kernels, but this view has not been fully tested. We therefore revisit the performance of large kernel CNNs in downstream task, focusing on the weakly supervised object localization (WSOL) task. WSOL, a difficult downstream task that is not fully supervised, provides a new angle to explore the capabilities of the large kernel CNNs. Our study compares the modern large kernel CNNs ConvNeXt, RepLKNet, and SLaK to test the validity of the naive expectation that ERF size is important for improving downstream task performance. Our analysis of the factors contributing to high performance provides a different perspective, in which the main factor is feature map improvement. Furthermore, we find that modern CNNs are robust to the CAM problems of local regions of objects being activated, which has long been discussed in WSOL. CAM is the most classic WSOL method, but because of the above-mentioned problems, it is often used as a baseline method for comparison. However, experiments on the CUB-200-2011 dataset show that simply combining a large kernel CNN, CAM, and simple data augmentation methods can achieve performance (90.99% MaxBoxAcc) comparable to the latest WSOL method, which is CNN-based and requires special training or complex post-processing. The code is available at https://github.com/snskysk/CAM-Back-Again.

CAM Back Again: Large Kernel CNNs from a Weakly Supervised Object Localization Perspective

TL;DR

This work challenges the prevailing view that enlarging receptive fields (ERFs) is the primary driver of strong downstream performance in large-kernel CNNs. By evaluating ConvNeXt, RepLKNet, and SLaK on weakly supervised object localization (WSOL) with the classic CAM approach, the authors show that improved feature maps and architectural properties underpin localization quality far more than ERF size. They demonstrate that modern backbones can mitigate CAM limitations, producing global object activation and thereby achieving strong WSOL scores even with simple CAM and data augmentation. A key finding is that a PC1-based localization approach can surpass state-of-the-art CNN-based WSOL, highlighting the potential of feature-map-driven improvements for localization tasks. The results advocate focusing on architectural design and feature-map quality to advance WSOL and related localization applications.

Abstract

Recently, convolutional neural networks (CNNs) with large size kernels have attracted much attention in the computer vision field, following the success of the Vision Transformers. Large kernel CNNs have been reported to perform well in downstream vision tasks as well as in classification performance. The reason for the high-performance of large kernel CNNs in downstream tasks has been attributed to the large effective receptive field (ERF) produced by large size kernels, but this view has not been fully tested. We therefore revisit the performance of large kernel CNNs in downstream task, focusing on the weakly supervised object localization (WSOL) task. WSOL, a difficult downstream task that is not fully supervised, provides a new angle to explore the capabilities of the large kernel CNNs. Our study compares the modern large kernel CNNs ConvNeXt, RepLKNet, and SLaK to test the validity of the naive expectation that ERF size is important for improving downstream task performance. Our analysis of the factors contributing to high performance provides a different perspective, in which the main factor is feature map improvement. Furthermore, we find that modern CNNs are robust to the CAM problems of local regions of objects being activated, which has long been discussed in WSOL. CAM is the most classic WSOL method, but because of the above-mentioned problems, it is often used as a baseline method for comparison. However, experiments on the CUB-200-2011 dataset show that simply combining a large kernel CNN, CAM, and simple data augmentation methods can achieve performance (90.99% MaxBoxAcc) comparable to the latest WSOL method, which is CNN-based and requires special training or complex post-processing. The code is available at https://github.com/snskysk/CAM-Back-Again.
Paper Structure (28 sections, 2 equations, 20 figures, 4 tables)

This paper contains 28 sections, 2 equations, 20 figures, 4 tables.

Figures (20)

  • Figure 1: WSOL scores (MaxBoxAcc) for CNNs on the CUB-200-2011 dataset. Marker size indicates kernel size. All models pretrained on ImageNet1K are fine-tuned for classifying the CUB-200-2011 dataset. For ConvNeXt, RepLKNet, and SLaK, scores are measured for various models with different pre-training data, pixel size for fine tuning, and number of model parameters.
  • Figure 2: Illustration of the problem that CAMs generated from CNN classifiers tend to locally activate discriminative parts of objects. The figure for $F_i$ represents the problem between the activation area size and the weight size: feature maps with smaller activation regions tend to have larger weights and larger contributions to the CAM. The figure for $F_j$, on the other hand, represents the local activation region problem. This occurs because feature maps corresponding to negative weights activate non-discriminative regions within the object region. See \ref{['sec:app_cam_pbm']} for more information.
  • Figure 3: Examples of CAM generation results for the CUB-200-2011 dataset by ConvNeXt, RepLKNet, and SLaK. Among the latest CNNs, ConvNeXt and RepLKNet tend to globally activate the entire object.
  • Figure 4: Each image group represents $F_{pos}$ and $F_{neg}$ obtained from one input image for each of ConvNeXt, RepLKNet and SLaK. The red-framed heatmap represents the CAM generated from $F_{pos}$ only, the blue-framed heatmap represents the CAM generated from $F_{neg}$ only, and the center heatmap represents the normal CAM.
  • Figure 5: Relationship between activation areas and weights of feature maps. The activation area is calculated by binarizing the feature map with a threshold of 10 and calculating the percentage of pixels that exceed the threshold. Experiments on ConvNeXt, RepLKNet, and SLaK WSOL top-score models fine-tuned on the CUB-200-2011 dataset. See \ref{['sec:app_fmap_ana']} for results in the non-best-scoring models.
  • ...and 15 more figures