On the Shift Invariance of Max Pooling Feature Maps in Convolutional Neural Networks

Hubert Leterme; Kévin Polisano; Valérie Perrier; Karteek Alahari

On the Shift Invariance of Max Pooling Feature Maps in Convolutional Neural Networks

Hubert Leterme, Kévin Polisano, Valérie Perrier, Karteek Alahari

TL;DR

This work analyzes how max pooling in CNNs interacts with first-layer Gabor-like filters to influence shift invariance. By introducing a complex modulus operator $U^{\mathrm{mod}}$ and a real max-pooling operator $U^{\mathrm{max}}$, the authors develop a probabilistic framework and derive bounds that quantify when $U^{\mathrm{max}}$ approximates $U^{\mathrm{mod}}$ under bandwidth and grid-resolution constraints. The theory is extended to multichannel convolutions and validated through a deterministic DT-$\mathbb{C}$WPT case study, showing that modulus-based representations offer near-translation invariance and can serve as a stable proxy for real max pooling in practice. The results justify a domain where $\mathbb{R}$Max and $\mathbb{C}$Mod outputs align closely, informing architecture design to preserve high-frequency information while achieving translation stability in early CNN layers.

Abstract

This paper focuses on improving the mathematical interpretability of convolutional neural networks (CNNs) in the context of image classification. Specifically, we tackle the instability issue arising in their first layer, which tends to learn parameters that closely resemble oriented band-pass filters when trained on datasets like ImageNet. Subsampled convolutions with such Gabor-like filters are prone to aliasing, causing sensitivity to small input shifts. In this context, we establish conditions under which the max pooling operator approximates a complex modulus, which is nearly shift invariant. We then derive a measure of shift invariance for subsampled convolutions followed by max pooling. In particular, we highlight the crucial role played by the filter's frequency and orientation in achieving stability. We experimentally validate our theory by considering a deterministic feature extractor based on the dual-tree complex wavelet packet transform, a particular case of discrete Gabor-like decomposition.

On the Shift Invariance of Max Pooling Feature Maps in Convolutional Neural Networks

TL;DR

This work analyzes how max pooling in CNNs interacts with first-layer Gabor-like filters to influence shift invariance. By introducing a complex modulus operator

and a real max-pooling operator

, the authors develop a probabilistic framework and derive bounds that quantify when

approximates

under bandwidth and grid-resolution constraints. The theory is extended to multichannel convolutions and validated through a deterministic DT-

WPT case study, showing that modulus-based representations offer near-translation invariance and can serve as a stable proxy for real max pooling in practice. The results justify a domain where

Max and

Mod outputs align closely, informing architecture design to preserve high-frequency information while achieving translation stability in early CNN layers.

Abstract

Paper Structure (56 sections, 25 theorems, 269 equations, 8 figures, 1 table)

This paper contains 56 sections, 25 theorems, 269 equations, 8 figures, 1 table.

Introduction
Motivations and Main Contributions
Related Work
Wavelet Scattering Networks
Invariance Studies in CNNs
Paper Outline
Shift Invariance of $\mathbb{C}$Mod Outputs
Notations
Continuous Framework
Discrete Framework
Intuition
Continuous Framework
Adaptation to Discrete 2D Sequences
Shift Invariance in the Discrete Framework
From $\mathbb{C}$Mod to $\mathbb{R}$Max
...and 41 more sections

Key Result

Lemma 1

Given $\varepsilon > 0$ and $\boldsymbol{\nu} \in \mathbb{R}^2$, let $\varPsi \in \mathcal{V}\bigl(\boldsymbol{\nu},\, \varepsilon\bigr)$ denote a complex-valued filter such as defined in eq:gaborfilt_continuous. Next, for any real-valued function $F \in L^2_{\mathbb{R}}(\mathbb{R}^2)$, we consider Then $F_0$ is low-frequency. Specifically,

Figures (8)

Figure 1: Spatial (left) and Fourier (right) representations of convolution kernels in the first layer of AlexNet, after training with ImageNet ILSVRC 2012-2017 Russakovsky2015. Each kernel connects the $3$ RGB input channels to one of the $64$ output channels.
Figure 2: (a), (b): Real and imaginary parts of a Gabor-like filter $\mathrm{W}$ as defined in \ref{['eq:hilberttransform']}. (c), (d): Magnitude spectra (modulus of the Fourier transform) of $\mathrm{V}$ and $\mathrm{W}$, respectively.
Figure 3: Search for the maximum value of $\boldsymbol h \mapsto G_\mathrm{X}(\boldsymbol{x},\, \boldsymbol h)$ over a discrete grid of size $3 \times 3$, i.e., $q = 1$. This figure displays $3$ examples with different frequencies $\boldsymbol{\nu} := \boldsymbol{\theta} / s$ and phases $H_\mathrm{X}(\boldsymbol{x})$. Hopefully the result will be close to the true maximum (left), but there are some pathological cases in which all points in the grid fall into pits (middle and right).
Figure 4: Top: 2D representation of $\boldsymbol h \mapsto G_\mathrm{X}(\boldsymbol{x}_{\boldsymbol{n}},\, \boldsymbol h)$\ref{['eq:cos_discrete']}, for two different values of $\boldsymbol{\theta} \in \mathbb{R}^2$, $q = 1$ and arbitrary values of $m \in \mathbb{N} \setminus \{0\}$ and $s \in \mathbb{R} \setminus \{0\}$. Assuming the plots are centered around $\boldsymbol h = \boldsymbol 0$, each point materializes a location $\boldsymbol h_{\boldsymbol{p}}$ in the max pooling grid, for $\boldsymbol{p} \in \left\{-q,\, \dots,\, q\right\}^2$. The desirable situation occurs when one of these locations falls near a ridge (bright areas), in which case the outputs produced by $\mathbb{R}$Max and $\mathbb{C}$Mod are similar---see \ref{['eq:discrete_left0']}. Each number $i \in \left\{0,\, \dots,\, 8\right\}$ represents the rank of $Z_{\boldsymbol{p}} \in \mathbb{S}^1$\ref{['eq:projunitcircle']}, when these values are sorted by ascending order of their arguments \ref{['eq:phaseshift']}. If location $\boldsymbol h_{\boldsymbol{p}}$ gets ranked $i$, then we have $Z_{\boldsymbol{p}} = Z_{i}^{(q)}$. Bottom: polar representations of $g_{\mathop{\mathrm{max}}\nolimits}: \mathbb{S}^1 \to \left[-1,\, 1\right]$\ref{['eq:maxCosUnitcircle']}, corresponding to the same settings. The closer the curve is from the outer ring, the more likely some points $\boldsymbol h_{\boldsymbol{p}}$ will fall near a ridge of $G_\mathrm{X}$. (a) Case where the values $Z_{\boldsymbol{p}}$ are roughly evenly distributed on $\mathbb{S}^1$. (b) Case where these values are concentrated in a small portion of the unit circle. The most extreme cases occurs when $Z_{\boldsymbol{p}} = 1$ for any $\boldsymbol{p}$. \ref{['fig:discretegrid']} (middle and right) depicts two such situations.
Figure 5: $\gamma(m\boldsymbol{\theta})^2$ as a function of the kernel characteristic frequency $\boldsymbol{\theta} \in \left[-\pi,\, \pi\right]^2$. According to \ref{['th:scdmoment_normdiff_cmodrmax']}, this quantity provides an approximate bound for the expected quadratic error between $\mathbb{R}$Max and $\mathbb{C}$Mod outputs. The subsampling factor $m$ has been set to $2$ as in ResNet (left), and $4$ as in AlexNet (right). The bright regions correspond to frequencies for which the two outputs are expected to be similar. However, in the dark regions, pathological cases such as illustrated in \ref{['fig:discretegrid']} are more likely to occur.
...and 3 more figures

Theorems & Definitions (77)

Remark 1
Remark 2
Lemma 1
proof
Proposition 1
proof
Lemma 2
proof
Proposition 2
proof
...and 67 more

On the Shift Invariance of Max Pooling Feature Maps in Convolutional Neural Networks

TL;DR

Abstract

On the Shift Invariance of Max Pooling Feature Maps in Convolutional Neural Networks

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (77)