Table of Contents
Fetching ...

The Inductive Bias of Convolutional Neural Networks: Locality and Weight Sharing Reshape Implicit Regularization

Tongtong Liang, Esha Singh, Rahul Parhi, Alexander Cloninger, Yu-Xiang Wang

TL;DR

It is proved that provided the receptive field size $m$ remains small relative to the ambient dimension $d$, these networks generalize on spherical data with a rate of $n^{-\frac{1}{6} +O(m/d)}$, a regime where fully connected networks provably fail.

Abstract

We study how architectural inductive bias reshapes the implicit regularization induced by the edge-of-stability phenomenon in gradient descent. Prior work has established that for fully connected networks, the strength of this regularization is governed solely by the global input geometry; consequently, it is insufficient to prevent overfitting on difficult distributions such as the high-dimensional sphere. In this paper, we show that locality and weight sharing fundamentally change this picture. Specifically, we prove that provided the receptive field size $m$ remains small relative to the ambient dimension $d$, these networks generalize on spherical data with a rate of $n^{-\frac{1}{6} +O(m/d)}$, a regime where fully connected networks provably fail. This theoretical result confirms that weight sharing couples the learned filters to the low-dimensional patch manifold, thereby bypassing the high dimensionality of the ambient space. We further corroborate our theory by analyzing the patch geometry of natural images, showing that standard convolutional designs induce patch distributions that are highly amenable to this stability mechanism, thus providing a systematic explanation for the superior generalization of convolutional networks over fully connected baselines.

The Inductive Bias of Convolutional Neural Networks: Locality and Weight Sharing Reshape Implicit Regularization

TL;DR

It is proved that provided the receptive field size remains small relative to the ambient dimension , these networks generalize on spherical data with a rate of , a regime where fully connected networks provably fail.

Abstract

We study how architectural inductive bias reshapes the implicit regularization induced by the edge-of-stability phenomenon in gradient descent. Prior work has established that for fully connected networks, the strength of this regularization is governed solely by the global input geometry; consequently, it is insufficient to prevent overfitting on difficult distributions such as the high-dimensional sphere. In this paper, we show that locality and weight sharing fundamentally change this picture. Specifically, we prove that provided the receptive field size remains small relative to the ambient dimension , these networks generalize on spherical data with a rate of , a regime where fully connected networks provably fail. This theoretical result confirms that weight sharing couples the learned filters to the low-dimensional patch manifold, thereby bypassing the high dimensionality of the ambient space. We further corroborate our theory by analyzing the patch geometry of natural images, showing that standard convolutional designs induce patch distributions that are highly amenable to this stability mechanism, thus providing a systematic explanation for the superior generalization of convolutional networks over fully connected baselines.
Paper Structure (25 sections, 18 theorems, 166 equations, 8 figures, 1 table)

This paper contains 25 sections, 18 theorems, 166 equations, 8 figures, 1 table.

Key Result

Theorem 4.1

Fix $\mathcal{D}=\{({\bm{x}}_i,y_i)\}_{i=1}^n$ and local receptive fields $\mathcal{S}$. For any ${\bm{\theta}}\in{\bm{\Theta}}^{\mathcal{S}}_K$ in the model eq:model_CNN, Specifically, for any ${\bm{\theta}}\in {\bm{\Theta}}_{\mathrm{BEoS}}^\mathcal{S}(\eta,\mathcal{D})$, we have

Figures (8)

  • Figure 1: In the overparameterized regime, model architectures and input data distribution jointly determine the implicit regularization of gradient descent. The local patch representations of the Convolutional NNs provably prevent the curse-of-dimensionality in normalized / whitened distributions that break Feedforward NNs.
  • Figure 2: Generalization-gap scaling in synthetic experiments.(Left)$\widehat{\mathrm{Gen}}(f_{\hat{\theta}},\mathcal{D})$ versus the sample size $n$ on a log--log scale. The fitted slope summarizes the empirical rate: if $\mathrm{GenGap}\lesssim n^{-c}$, then $\log(\mathrm{GenGap})\le -c\log n + b$, so a more negative slope indicates faster decay (better generalization). In our experiments, the FCN slope is nearly flat (slope $=-0.03$ at $d=10$), whereas LCN-WS exhibits increasingly negative slopes as $d$ grows (slope $=-0.34$ at $d=100$, $-0.69$ at $d=200$, and $-0.86$ at $d=400$), indicating faster decay with $n$. (Right)$\widehat{\mathrm{Gen}}(f_{\hat{\theta}},\mathcal{D})$ versus the ambient dimension $d$ with $n$ fixed, illustrating that for LCN-WS (with patch size $m\ll d$) the generalization gap remains stable and can even decrease as $d$ increases.
  • Figure 3: Stable interpolation may still happen. An LCN-WS trained with $\eta=0.2$ interpolates noisy labels while stable $\lambda_{\max} \approx \frac{2}{\eta}=10$. (Right) The concentration of neurons of low-rate activation indicates that the stable interpolating network exploits the property exhibited by $g_{\mathcal{D},\mathcal{S}}$.
  • Figure 4: Patch geometry vs. image geometry on CIFAR-10. The patch point cloud is significantly of lower intrinsic dimension. (Left): PCA explained-variance curves for the patch cloud (dimension $m=27$, but 3 directions dominates 90% variance) and the ambient image cloud (dimension $d=3072$, and need more than 100 directions to dominate 90% variance). (Right): half-space concentration curves $\Psi(T)$; larger area indicates more deep points and fewer opportunities for near-isolating ReLU hyperplanes.
  • Figure 5: Real-data validation on CIFAR-10. We compare FCN and LCN-WS on a nonparametric regression task with noisy labels, using inputs sampled from CIFAR-10. Here LCN-WS denotes a CNN with a $3\times 3$ kernel (stride $1$, no padding). (Left) FCN memorizes noise (train loss $\ll \sigma^2$), whereas LCN-WS plateaus near the noise floor. (Right) LCN-WS achieves decreasing excess risk while FCN fails to learn. This behavior is consistent with Figure \ref{['fig:patch_vs_image_geometry']}: the patch point cloud appears more structured than the full image cloud, which can induce stronger implicit regularization.
  • ...and 3 more figures

Theorems & Definitions (38)

  • Definition 3.1: Below Edge of Stability (BEoS) qiao2024stable
  • Theorem 4.1
  • Theorem 4.2
  • Theorem 4.3: Stable interpolation with width $\le n$
  • Remark 4.4
  • Definition 1.1
  • Definition 1.2
  • Remark 1.3: "Arbitrary width" $\neq$ "infinite width"
  • Definition 1.4: Covering Number and Entropy
  • Proposition 1.5: parhi2023near
  • ...and 28 more