Role of Locality and Weight Sharing in Image-Based Tasks: A Sample Complexity Separation between CNNs, LCNs, and FCNs

Aakash Lahoti; Stefani Karp; Ezra Winston; Aarti Singh; Yuanzhi Li

Role of Locality and Weight Sharing in Image-Based Tasks: A Sample Complexity Separation between CNNs, LCNs, and FCNs

Aakash Lahoti, Stefani Karp, Ezra Winston, Aarti Singh, Yuanzhi Li

TL;DR

This work tackles how locality and weight sharing shape sample efficiency in image-like tasks by introducing the Dynamic Signal Distribution (DSD) task, which models images as k patches with a signal unknownly placed in any patch. It proves that, under gradient-descent-like equivariant algorithms, CNNs achieve a sample complexity of O(k+d) whereas FCNs incur Omega(k^2 d) and LCNs incur Omega(kd), highlighting a separation driven by architectural biases. The authors develop a novel randomized Fano framework to derive these lower bounds and provide corresponding upper bounds via equivariant gradient methods for each architecture. The results illustrate the concrete statistical advantages of locality and weight sharing in real vision tasks and point to future work incorporating multiple signals and deeper CNN architectures.

Abstract

Vision tasks are characterized by the properties of locality and translation invariance. The superior performance of convolutional neural networks (CNNs) on these tasks is widely attributed to the inductive bias of locality and weight sharing baked into their architecture. Existing attempts to quantify the statistical benefits of these biases in CNNs over locally connected convolutional neural networks (LCNs) and fully connected neural networks (FCNs) fall into one of the following categories: either they disregard the optimizer and only provide uniform convergence upper bounds with no separating lower bounds, or they consider simplistic tasks that do not truly mirror the locality and translation invariance as found in real-world vision tasks. To address these deficiencies, we introduce the Dynamic Signal Distribution (DSD) classification task that models an image as consisting of $k$ patches, each of dimension $d$, and the label is determined by a $d$-sparse signal vector that can freely appear in any one of the $k$ patches. On this task, for any orthogonally equivariant algorithm like gradient descent, we prove that CNNs require $\tilde{O}(k+d)$ samples, whereas LCNs require $Ω(kd)$ samples, establishing the statistical advantages of weight sharing in translation invariant tasks. Furthermore, LCNs need $\tilde{O}(k(k+d))$ samples, compared to $Ω(k^2d)$ samples for FCNs, showcasing the benefits of locality in local tasks. Additionally, we develop information theoretic tools for analyzing randomized algorithms, which may be of interest for statistical research.

Role of Locality and Weight Sharing in Image-Based Tasks: A Sample Complexity Separation between CNNs, LCNs, and FCNs

TL;DR

Abstract

patches, each of dimension

, and the label is determined by a

-sparse signal vector that can freely appear in any one of the

patches. On this task, for any orthogonally equivariant algorithm like gradient descent, we prove that CNNs require

samples, whereas LCNs require

samples, establishing the statistical advantages of weight sharing in translation invariant tasks. Furthermore, LCNs need

samples, compared to

samples for FCNs, showcasing the benefits of locality in local tasks. Additionally, we develop information theoretic tools for analyzing randomized algorithms, which may be of interest for statistical research.

Paper Structure (24 sections, 14 theorems, 166 equations, 4 figures, 1 algorithm)

This paper contains 24 sections, 14 theorems, 166 equations, 4 figures, 1 algorithm.

Introduction
Other Related Works
Notation
Our Setting
Dynamic Signal Distribution (DSD)
Neural Network Architectures
Mathematical background
Technical Definitions
Equivariant Algorithms
Minimax Framework
FCNs vs LCNs Separation Results
LCNs vs CNNs Separation Results
Conclusion And Future Work
Restated Gilbert Varshamov Bound
Proof of Theorem \ref{['modified_fano']}
...and 9 more sections

Key Result

Lemma 5.1

(Section 4.1 zhiyuan) If $\bar{\theta}_n$ is a ${\mathcal{U}}$-equivariant algorithm, then $\forall {\bm{x}} \in {\mathcal{X}}, {\mathbf{U}} \in {\mathcal{U}}$, where the randomness is over initialization.

Figures (4)

Figure 1: From the Cats Dataset cats. The cat, which is the class-determining signal, varies in position across images, showing the translation property amidst background noise.
Figure 1: Test error incurred by CNNs, LCNs and FCNs for various values of $(k,d)$
Figure 2: Sample complexity for CNNs (left) and LCNs (right) across various values of $k$
Figure 3: Sample complexity for CNNs (left) and LCNs (right) across various values of $d$

Theorems & Definitions (38)

Definition 1: Loss Function
Definition 2: Risk
Definition 3: Algorithm
Definition 4: Iterative (Randomized) Algorithm
Definition 5: Sample Complexity
Definition 6: ${\mathcal{U}}$-equivariant algorithm
Lemma 5.1
Definition 7: Minimax Risk
Theorem 5.1: Fano's Theorem for Randomized Algorithms
Remark 1
...and 28 more

Role of Locality and Weight Sharing in Image-Based Tasks: A Sample Complexity Separation between CNNs, LCNs, and FCNs

TL;DR

Abstract

Role of Locality and Weight Sharing in Image-Based Tasks: A Sample Complexity Separation between CNNs, LCNs, and FCNs

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (38)