Theoretical Analysis of Inductive Biases in Deep Convolutional Networks

Zihao Wang; Lei Wu

Theoretical Analysis of Inductive Biases in Deep Convolutional Networks

Zihao Wang, Lei Wu

TL;DR

This work provides a theoretical account of CNN inductive biases by proving that deep CNNs can universally approximate continuous functions with depth $O( ext{log } d)$ thanks to a synergy between multichanneling and downsampling, while downsampling is essential for this efficiency. It further shows CNNs can efficiently learn long-range sparse functions with near-optimal sample complexity $ ilde{O}( ext{log}^2 d)$, aided by Barron-regularity and adaptive feature construction. By disentangling weight sharing and locality, the authors establish provable separations: CNNs outperform LCNs by exploiting weight sharing (lower sample complexity), and LCNs outperform FCNs by leveraging locality (lower parameter and sample complexity). The results are underpinned by a group-equivariance framework and a Fano-based, random-estimator minimax analysis, providing deep insights into how architectural biases and learning dynamics interact to shape learnability and generalization. The findings give theoretical justification for the empirical superiority of CNNs over FCNs in vision-like tasks and offer a principled lens for designing architectures that balance depth, downsampling, and connectivity patterns.

Abstract

In this paper, we provide a theoretical analysis of the inductive biases in convolutional neural networks (CNNs). We start by examining the universality of CNNs, i.e., the ability to approximate any continuous functions. We prove that a depth of $\mathcal{O}(\log d)$ suffices for deep CNNs to achieve this universality, where $d$ in the input dimension. Additionally, we establish that learning sparse functions with CNNs requires only $\widetilde{\mathcal{O}}(\log^2d)$ samples, indicating that deep CNNs can efficiently capture {\em long-range} sparse correlations. These results are made possible through a novel combination of the multichanneling and downsampling when increasing the network depth. We also delve into the distinct roles of weight sharing and locality in CNNs. To this end, we compare the performance of CNNs, locally-connected networks (LCNs), and fully-connected networks (FCNs) on a simple regression task, where LCNs can be viewed as CNNs without weight sharing. On the one hand, we prove that LCNs require $Ω(d)$ samples while CNNs need only $\widetilde{\mathcal{O}}(\log^2d)$ samples, highlighting the critical role of weight sharing. On the other hand, we prove that FCNs require $Ω(d^2)$ samples, whereas LCNs need only $\widetilde{\mathcal{O}}(d)$ samples, underscoring the importance of locality. These provable separations quantify the difference between the two biases, and the major observation behind our proof is that weight sharing and locality break different symmetries in the learning process.

Theoretical Analysis of Inductive Biases in Deep Convolutional Networks

TL;DR

This work provides a theoretical account of CNN inductive biases by proving that deep CNNs can universally approximate continuous functions with depth

thanks to a synergy between multichanneling and downsampling, while downsampling is essential for this efficiency. It further shows CNNs can efficiently learn long-range sparse functions with near-optimal sample complexity

, aided by Barron-regularity and adaptive feature construction. By disentangling weight sharing and locality, the authors establish provable separations: CNNs outperform LCNs by exploiting weight sharing (lower sample complexity), and LCNs outperform FCNs by leveraging locality (lower parameter and sample complexity). The results are underpinned by a group-equivariance framework and a Fano-based, random-estimator minimax analysis, providing deep insights into how architectural biases and learning dynamics interact to shape learnability and generalization. The findings give theoretical justification for the empirical superiority of CNNs over FCNs in vision-like tasks and offer a principled lens for designing architectures that balance depth, downsampling, and connectivity patterns.

Abstract

suffices for deep CNNs to achieve this universality, where

in the input dimension. Additionally, we establish that learning sparse functions with CNNs requires only

samples, indicating that deep CNNs can efficiently capture {\em long-range} sparse correlations. These results are made possible through a novel combination of the multichanneling and downsampling when increasing the network depth. We also delve into the distinct roles of weight sharing and locality in CNNs. To this end, we compare the performance of CNNs, locally-connected networks (LCNs), and fully-connected networks (FCNs) on a simple regression task, where LCNs can be viewed as CNNs without weight sharing. On the one hand, we prove that LCNs require

samples while CNNs need only

samples, highlighting the critical role of weight sharing. On the other hand, we prove that FCNs require

samples, whereas LCNs need only

samples, underscoring the importance of locality. These provable separations quantify the difference between the two biases, and the major observation behind our proof is that weight sharing and locality break different symmetries in the learning process.

Paper Structure (52 sections, 49 theorems, 245 equations, 2 figures)

This paper contains 52 sections, 49 theorems, 245 equations, 2 figures.

Introduction
Our Results
Disentangling the weight sharing and locality.
Notations
Preliminaries
Network Architectures
FCNs.
Universal Approximation
Efficient Learning of Sparse Functions
Disentangle the Inductive Biases of Weight Sharing and Locality
Learning Algorithm, Group Equivariance, and Lower Bounds
Sample Complexity.
CNNs vs. LCNs
LCNs vs. FCNs
Conclusion
...and 37 more sections

Key Result

Theorem 3.1

Consider CNNs with all activation functions to be $\operatorname{ReLU}$. Suppose $L=\log_2(4d)$ and $C_l=2^{l+1}$ for $l\in [L-1]$ to be fixed and allow the number of channels of the last layer $C_L$ to increase. Then, the CNNs are universal: for any $\epsilon>0$, any compact set $\Omega\subset \mat

Figures (2)

Figure 1: A diagram illustration of how CNNs select coordinates adaptively. In this case $d=8, L=3$. The nonzero coordinate is $i=4$, for which $a_0=1,a_1=1,a_2=0$. The values on edges represent the weights, which are set according to the proof of Lemma \ref{['lemma: linear-net']}.
Figure 2: CNN can learn sparse functions efficiently. Both short-range (left) and long-range (right) sparse target functions are considered in this experiment. The training is stopped when the training loss drops below $10^{-5}$.

Theorems & Definitions (82)

Theorem 3.1: Universality
Proposition 3.2
Proposition 3.3
Definition 4.1: Sparse function
Lemma 4.2: Adaptive coordinate selection for Linear CNNs
proof
Remark 4.3
Lemma 4.4
Definition 4.5: Barron space
Theorem 4.6
...and 72 more

Theoretical Analysis of Inductive Biases in Deep Convolutional Networks

TL;DR

Abstract

Theoretical Analysis of Inductive Biases in Deep Convolutional Networks

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (82)