Table of Contents
Fetching ...

Wavelet Convolutions for Large Receptive Fields

Shahaf E. Finder, Roy Amoyal, Eran Treister, Oren Freifeld

TL;DR

This paper introduces WTConv, a drop-in wavelet-transform–based layer that enables very large receptive fields for CNNs with trainable parameters growing only logarithmically with the target size. By performing convolutions in the wavelet domain on multi-frequency components and cascading levels through the Haar WT, WTConv achieves broad, multi-frequency receptive-field coverage while maintaining parameter efficiency and locality. Empirical results on ImageNet-1K, ADE20K, and COCO demonstrate improvements in classification accuracy, segmentation and detection, along with enhanced robustness to corruptions and a greater shape bias. The method offers a practical alternative to global self-attention, providing a scalable path to large receptive fields in convolutional architectures, with code available at the authors’ repository.

Abstract

In recent years, there have been attempts to increase the kernel size of Convolutional Neural Nets (CNNs) to mimic the global receptive field of Vision Transformers' (ViTs) self-attention blocks. That approach, however, quickly hit an upper bound and saturated way before achieving a global receptive field. In this work, we demonstrate that by leveraging the Wavelet Transform (WT), it is, in fact, possible to obtain very large receptive fields without suffering from over-parameterization, e.g., for a $k \times k$ receptive field, the number of trainable parameters in the proposed method grows only logarithmically with $k$. The proposed layer, named WTConv, can be used as a drop-in replacement in existing architectures, results in an effective multi-frequency response, and scales gracefully with the size of the receptive field. We demonstrate the effectiveness of the WTConv layer within ConvNeXt and MobileNetV2 architectures for image classification, as well as backbones for downstream tasks, and show it yields additional properties such as robustness to image corruption and an increased response to shapes over textures. Our code is available at https://github.com/BGU-CS-VIL/WTConv.

Wavelet Convolutions for Large Receptive Fields

TL;DR

This paper introduces WTConv, a drop-in wavelet-transform–based layer that enables very large receptive fields for CNNs with trainable parameters growing only logarithmically with the target size. By performing convolutions in the wavelet domain on multi-frequency components and cascading levels through the Haar WT, WTConv achieves broad, multi-frequency receptive-field coverage while maintaining parameter efficiency and locality. Empirical results on ImageNet-1K, ADE20K, and COCO demonstrate improvements in classification accuracy, segmentation and detection, along with enhanced robustness to corruptions and a greater shape bias. The method offers a practical alternative to global self-attention, providing a scalable path to large receptive fields in convolutional architectures, with code available at the authors’ repository.

Abstract

In recent years, there have been attempts to increase the kernel size of Convolutional Neural Nets (CNNs) to mimic the global receptive field of Vision Transformers' (ViTs) self-attention blocks. That approach, however, quickly hit an upper bound and saturated way before achieving a global receptive field. In this work, we demonstrate that by leveraging the Wavelet Transform (WT), it is, in fact, possible to obtain very large receptive fields without suffering from over-parameterization, e.g., for a receptive field, the number of trainable parameters in the proposed method grows only logarithmically with . The proposed layer, named WTConv, can be used as a drop-in replacement in existing architectures, results in an effective multi-frequency response, and scales gracefully with the size of the receptive field. We demonstrate the effectiveness of the WTConv layer within ConvNeXt and MobileNetV2 architectures for image classification, as well as backbones for downstream tasks, and show it yields additional properties such as robustness to image corruption and an increased response to shapes over textures. Our code is available at https://github.com/BGU-CS-VIL/WTConv.
Paper Structure (30 sections, 10 equations, 8 figures, 14 tables, 1 algorithm)

This paper contains 30 sections, 10 equations, 8 figures, 14 tables, 1 algorithm.

Figures (8)

  • Figure 1: The Effective Receptive Fields Luo:NIPS:2016:erf of ConvNeXt-T Liu:CVPR:2022:convnext with different depth-wise convolutions. Evidently, the proposed WTConv achieves the largest field despite using fewer trainable parameters. This improves the convolution's ability to capture low frequencies and thus increases (i.e., improves) its shape bias, among other advantages.
  • Figure 2: Performing convolution in the wavelet domain results in a larger receptive field. In this example, a $3\times3$ convolution is performed on the low-frequency band of the second-level wavelet domain $X^{(2)}_{LL}$, resulting in a 9-parameter convolution that responds to lower frequencies of a $12\times12$ receptive field in the input $X$.
  • Figure 3: An example of the WTConv operation on a single channel taken from the third inverted residual block of MobileNetV2 (see \ref{['subsec:analysis']}) using a 2-level wavelet decomposition and $3\times3$ kernel sizes for the convolutions.
  • Figure 4: Shape bias comparison of ConvNeXt-T/S/B and WTConvNeXt-T/S/B over 16 categories. The vertical line is the average across categories.
  • Figure 5: Motion blur corruption example. Each row represents an increased corruption severity. Left to right: the corrupted input, ConvNeXt-S backbone detection, WTConvNeXt-S backbone detection. Note that WTConvNeXt detects the traffic light even at the worst corruption level.
  • ...and 3 more figures