Table of Contents
Fetching ...

Which Frequencies do CNNs Need? Emergent Bottleneck Structure in Feature Learning

Yuxiao Wen, Arthur Jacot

TL;DR

The paper addresses why CNNs naturally learn to operate through a confined bottleneck, arguing that deep CNNs tend to compress inputs into a representation supported on a small set of Fourier frequencies before reconstituting outputs. It introduces the Convolutional Bottleneck Rank $\mathrm{Rank}_{\text{CBN}}$ and the representation costs $R^{(0)}$ and $R^{(1)}$, showing that in the large-depth limit $R(f;\Omega,L) \approx L R^{(0)}(f;\Omega)$ with a finite-depth correction $R^{(1)}(f;\Omega)$ that encodes regularity via the Jacobian. The authors prove upper and lower bounds linking these costs to per-frequency singular values and demonstrate that almost-minimal-norm CNNs exhibit bottlenecks in both weights and activations, supporting the practical use of down-sampling. They extend the theory to CNNs with up- and down-sampling, provide numerical experiments (e.g., MNIST, autoencoders, Newtonian mechanics) that yield interpretable latent frequencies, and discuss limitations and directions for broader applicability. Overall, the work offers a principled explanation for down-sampling and a frequency-based interpretation of learned CNN representations with implications for efficiency and interpretability.

Abstract

We describe the emergence of a Convolution Bottleneck (CBN) structure in CNNs, where the network uses its first few layers to transform the input representation into a representation that is supported only along a few frequencies and channels, before using the last few layers to map back to the outputs. We define the CBN rank, which describes the number and type of frequencies that are kept inside the bottleneck, and partially prove that the parameter norm required to represent a function $f$ scales as depth times the CBN rank $f$. We also show that the parameter norm depends at next order on the regularity of $f$. We show that any network with almost optimal parameter norm will exhibit a CBN structure in both the weights and - under the assumption that the network is stable under large learning rate - the activations, which motivates the common practice of down-sampling; and we verify that the CBN results still hold with down-sampling. Finally we use the CBN structure to interpret the functions learned by CNNs on a number of tasks.

Which Frequencies do CNNs Need? Emergent Bottleneck Structure in Feature Learning

TL;DR

The paper addresses why CNNs naturally learn to operate through a confined bottleneck, arguing that deep CNNs tend to compress inputs into a representation supported on a small set of Fourier frequencies before reconstituting outputs. It introduces the Convolutional Bottleneck Rank and the representation costs and , showing that in the large-depth limit with a finite-depth correction that encodes regularity via the Jacobian. The authors prove upper and lower bounds linking these costs to per-frequency singular values and demonstrate that almost-minimal-norm CNNs exhibit bottlenecks in both weights and activations, supporting the practical use of down-sampling. They extend the theory to CNNs with up- and down-sampling, provide numerical experiments (e.g., MNIST, autoencoders, Newtonian mechanics) that yield interpretable latent frequencies, and discuss limitations and directions for broader applicability. Overall, the work offers a principled explanation for down-sampling and a frequency-based interpretation of learned CNN representations with implications for efficiency and interpretability.

Abstract

We describe the emergence of a Convolution Bottleneck (CBN) structure in CNNs, where the network uses its first few layers to transform the input representation into a representation that is supported only along a few frequencies and channels, before using the last few layers to map back to the outputs. We define the CBN rank, which describes the number and type of frequencies that are kept inside the bottleneck, and partially prove that the parameter norm required to represent a function scales as depth times the CBN rank . We also show that the parameter norm depends at next order on the regularity of . We show that any network with almost optimal parameter norm will exhibit a CBN structure in both the weights and - under the assumption that the network is stable under large learning rate - the activations, which motivates the common practice of down-sampling; and we verify that the CBN results still hold with down-sampling. Finally we use the CBN structure to interpret the functions learned by CNNs on a number of tasks.
Paper Structure (21 sections, 18 theorems, 87 equations, 4 figures)

This paper contains 21 sections, 18 theorems, 87 equations, 4 figures.

Key Result

Theorem 3.1

For any translationally equivariant function $f$ with finite CBN rank, there is a constant $c$ that depends only on the target function $f$ s.t.

Figures (4)

  • Figure 1: We train a CNN ($L=11,c_\ell=60,\lambda=0.005,\beta=0.5$) on MNIST. The inputs are $28\times28$, and scaled down by 2 on the 2nd and 4th layers, with global average pooling and a fully connected layer at the end. We see that for classification, six constant frequencies are kept.
  • Figure 2: We train an autoencoder ($L=12,c_\ell=50,\lambda=0.04,\beta=1.0$) on the $0$-digits of MNIST downscaled to the size $13\times 13$. (a) The singular values of $MW_\ell$ for every layer $\ell$, colored by their frequency $\omega$. (b) Along each of the singular values in the $5$-th layer, we plot the effect of multiplying the hidden representation along the sing. vector by 2 or 0.5 (for non-constant frequencies we also consider multiplication by complex $i$ and $-i$). We see how each singular value correspond to a (nonlinear) direction of variation of the zeros. For non-constant frequencies the argument encodes the $x$ and $y$ position of the digit.
  • Figure 3: CNN ($L=10,c_\ell=60,\lambda=0.0005,\beta=0.25$) trained on images that are made up of random low-freq. shapes multiplied with a high frequency ($\omega=(5,5)$) pattern. In the bottleneck the network keeps track of the shapes in low frequencies ($\| \omega \|_1 \leq 2$) and the pattern in one $\omega=(5,5)$ frequency. Note that the original images only has signal in high frequencies around $(5,5)$.
  • Figure 4: CNN ($L=9,c_\ell=60,\lambda=0.0001,\beta=0.25$) learns to predict the trajectory of a ball under gravity: the inputs are 4 frames of a ball represented as a dot on a black background, and the outputs are the next four frames. The position appears to be encoded by the phase of the first pair, while the velocity is encoded in the difference between the phases of the two pairs, as confirmed in (b) along the $x$-axis.

Theorems & Definitions (38)

  • Remark 2.1
  • Theorem 3.1
  • proof
  • Theorem 3.2
  • Definition 3.3
  • Proposition 3.4
  • Theorem 4.1
  • Theorem 4.2
  • Definition 5.1
  • Remark 5.2
  • ...and 28 more