Table of Contents
Fetching ...

From CNNs to Shift-Invariant Twin Models Based on Complex Wavelets

Hubert Leterme, Kévin Polisano, Valérie Perrier, Karteek Alahari

TL;DR

The paper addresses the problem of weak shift invariance in CNNs caused by subsampling. It introduces a method that replaces the first-layer real-valued convolutions and max pooling with complex-valued convolutions and a modulus nonlinearity, selecting Gabor-like first-layer filters through a DT-CWPT-based mathematical twin (WCNN/CWNN). Key contributions include theoretical support for translation stability of the complex modulus on oriented band-pass filters, a practical twin-network framework with constrained Gabor channels, and extensive ImageNet and CIFAR-10 experiments showing improved accuracy and shift robustness with favorable compute/memory characteristics. The approach provides a non-destructive, efficient route to enhance shift invariance that can be integrated into existing CNNs and potentially extended to vision transformers via fixed embedding layers.

Abstract

We propose a novel method to increase shift invariance and prediction accuracy in convolutional neural networks. Specifically, we replace the first-layer combination "real-valued convolutions + max pooling" (RMax) by "complex-valued convolutions + modulus" (CMod), which is stable to translations, or shifts. To justify our approach, we claim that CMod and RMax produce comparable outputs when the convolution kernel is band-pass and oriented (Gabor-like filter). In this context, CMod can therefore be considered as a stable alternative to RMax. To enforce this property, we constrain the convolution kernels to adopt such a Gabor-like structure. The corresponding architecture is called mathematical twin, because it employs a well-defined mathematical operator to mimic the behavior of the original, freely-trained model. Our approach achieves superior accuracy on ImageNet and CIFAR-10 classification tasks, compared to prior methods based on low-pass filtering. Arguably, our approach's emphasis on retaining high-frequency details contributes to a better balance between shift invariance and information preservation, resulting in improved performance. Furthermore, it has a lower computational cost and memory footprint than concurrent work, making it a promising solution for practical implementation.

From CNNs to Shift-Invariant Twin Models Based on Complex Wavelets

TL;DR

The paper addresses the problem of weak shift invariance in CNNs caused by subsampling. It introduces a method that replaces the first-layer real-valued convolutions and max pooling with complex-valued convolutions and a modulus nonlinearity, selecting Gabor-like first-layer filters through a DT-CWPT-based mathematical twin (WCNN/CWNN). Key contributions include theoretical support for translation stability of the complex modulus on oriented band-pass filters, a practical twin-network framework with constrained Gabor channels, and extensive ImageNet and CIFAR-10 experiments showing improved accuracy and shift robustness with favorable compute/memory characteristics. The approach provides a non-destructive, efficient route to enhance shift invariance that can be integrated into existing CNNs and potentially extended to vision transformers via fixed embedding layers.

Abstract

We propose a novel method to increase shift invariance and prediction accuracy in convolutional neural networks. Specifically, we replace the first-layer combination "real-valued convolutions + max pooling" (RMax) by "complex-valued convolutions + modulus" (CMod), which is stable to translations, or shifts. To justify our approach, we claim that CMod and RMax produce comparable outputs when the convolution kernel is band-pass and oriented (Gabor-like filter). In this context, CMod can therefore be considered as a stable alternative to RMax. To enforce this property, we constrain the convolution kernels to adopt such a Gabor-like structure. The corresponding architecture is called mathematical twin, because it employs a well-defined mathematical operator to mimic the behavior of the original, freely-trained model. Our approach achieves superior accuracy on ImageNet and CIFAR-10 classification tasks, compared to prior methods based on low-pass filtering. Arguably, our approach's emphasis on retaining high-frequency details contributes to a better balance between shift invariance and information preservation, resulting in improved performance. Furthermore, it has a lower computational cost and memory footprint than concurrent work, making it a promising solution for practical implementation.
Paper Structure (47 sections, 2 theorems, 54 equations, 9 figures, 4 tables)

This paper contains 47 sections, 2 theorems, 54 equations, 9 figures, 4 tables.

Key Result

Proposition 1

We assume that the Fourier transform of $\widetilde{\mathrm{W}}_l$ is supported in a region of size $\kappa \times \kappa$ which does not contain the origin (Gabor-like filter). If, moreover, $\kappa \leq \frac{2\pi}{m}$, then

Figures (9)

  • Figure 1: Convolution kernels $\mathbf V \in \mathcal{S}^{64 \times 3}$ for the models based on AlexNet and ResNet-34, after training with ImageNet. Each image represents a 3D filter $(\mathrm{V}_{lk})_{k \in \left\{1 \mathinner {\ldotp \ldotp} 3\right\}}$, for any output channel $l \in \left\{1 \mathinner {\ldotp \ldotp} 64\right\}$. For our DT-$\mathbb{C}$WPT-based twin architecture (\ref{['subfig:convkernels_awyi', 'subfig:convkernels_rwyi']}), the $L_{\mathop{\mathrm{free}}\nolimits} := 32$ or $40$ first kernels are freely-trained, whereas the remaining $L_{\mathop{\mathrm{gabor}}\nolimits} := 32$ or $24$ kernels are constrained to be monochrome, band-pass and oriented. Left: representation in the spatial domain; right: corresponding power spectra.
  • Figure 2: AlexNet-based models: mean KL divergence between the outputs of shifted images. Legend: $^\dagger$blur pooling; $^\ast$$\mathbb{C}$Mod-based approach (ours).
  • Figure 3: Classification accuracy (ten-crops) vs consistency, measuring the stability of predictions to small input shifts, for AlexNet-based models (the lower the better for both axes). For each of the three architectures, we increased the blurring filter size from $1$ (i.e., no blur pooling) to $7$. The blue diamonds (no blur pooling) and red stars (blur pooling with filters of size $3$) correspond to the models for which evaluation metrics have been reported in \ref{['table:results_imagenet']} (models trained after $90$ epochs).
  • Figure 4: (a), (b): Real and imaginary parts of a Gabor-like convolution kernel $\mathrm{W}_{lk} := \mathrm{V}_{lk} + i\mathcal{H}(\mathrm{V}_{lk})$, forming a 2D Hilbert transform pair. (c), (d): Power spectra (energy of the Fourier transform) of $\mathrm{V}_{lk}$ and $\mathrm{W}_{lk}$, respectively.
  • Figure 5: First layers of AlexNet and its variants, corresponding to a convolution layer followed by ReLU and max pooling \ref{['eq:rmaxmodel']}. The models are framed according to the same colors and line styles as in \ref{['fig:valcurves_shifts']} (main paper). The green modules are the ones containing trainable parameters; the orange and purple modules represent static linear and nonlinear operators, respectively. The numbers between each module represent the depth (number of channels), height and width of each output. \ref{['subfig:models_alexnet']}: freely-trained models. Top: standard AlexNet. Bottom: Zhang's "blurpooled" AlexNet. \ref{['subfig:models_wavealexnet']}: mathematical twins (WAlexNet) reproducing the behavior of standard (top) and blurpooled (bottom) AlexNet. The left side of each diagram corresponds to the $L_{\mathop{\mathrm{free}}\nolimits} := 32$ freely-trained output channels, whereas the right side displays the $L_{\mathop{\mathrm{gabor}}\nolimits} := 32$ remaining channels, where freely-trained convolutions have been replaced by a wavelet block (WBlock) as described in \ref{['sec:appendix_wcnn_genarch']}. \ref{['subfig:models_cwavealexnet']}: $\mathbb{C}$Mod-based WAlexNet, where WBlock has been replaced by $\mathbb{C}$WBlock, and max pooling by a modulus. The bias and ReLU are placed after the modulus, following \ref{['eq:cmodmodel']}. In the bottom models, we compare Zhang's antialiasing approach (\ref{['subfig:models_wavealexnet']}) with ours (\ref{['subfig:models_cwavealexnet']}) in the Gabor channels.
  • ...and 4 more figures

Theorems & Definitions (2)

  • Proposition 1
  • Proposition 2