On the Sample Complexity of Two-Layer Networks: Lipschitz vs. Element-Wise Lipschitz Activation

Amit Daniely; Elad Granot

On the Sample Complexity of Two-Layer Networks: Lipschitz vs. Element-Wise Lipschitz Activation

Amit Daniely, Elad Granot

TL;DR

The paper analyzes the sample complexity of bounded two-layer networks under Lipschitz activations, showing that when activations are element-wise Lipschitz, the complexity scales polylogarithmically with width using ADL and a new chaining-enhanced approach. It provides a tight upper bound on the required samples and demonstrates a matching lower bound that non-element-wise activations can force width to impact complexity linearly or exponentially, sharpening the role of activation structure. The results connect initialization distance and norm-based controls to generalization, and extend the ADL framework to deeper considerations, offering techniques for future analysis. Overall, the work clarifies when width is not a bottleneck and highlights the essential nature of element-wise activations for favorable scaling in two-layer networks.

Abstract

We investigate the sample complexity of bounded two-layer neural networks using different activation functions. In particular, we consider the class $$ \mathcal{H} = \left\{\textbf{x}\mapsto \langle \textbf{v}, σ\circ W\textbf{b} + \textbf{b} \rangle : \textbf{b}\in\mathbb{R}^d, W \in \mathbb{R}^{\mathcal{T}\times d}, \textbf{v} \in \mathbb{R}^{\mathcal{T}}\right\} $$ where the spectral norm of $W$ and $\textbf{v}$ is bounded by $O(1)$, the Frobenius norm of $W$ is bounded from its initialization by $R > 0$, and $σ$ is a Lipschitz activation function. We prove that if $σ$ is element-wise, then the sample complexity of $\mathcal{H}$ has only logarithmic dependency in width and that this complexity is tight, up to logarithmic factors. We further show that the element-wise property of $σ$ is essential for a logarithmic dependency bound in width, in the sense that there exist non-element-wise activation functions whose sample complexity is linear in width, for widths that can be up to exponential in the input dimension. For the upper bound, we use the recent approach for norm-based bounds named Approximate Description Length (ADL) by arXiv:1910.05697. We further develop new techniques and tools for this approach that will hopefully inspire future works.

On the Sample Complexity of Two-Layer Networks: Lipschitz vs. Element-Wise Lipschitz Activation

TL;DR

Abstract

We investigate the sample complexity of bounded two-layer neural networks using different activation functions. In particular, we consider the class

where the spectral norm of

and

is bounded by

, the Frobenius norm of

is bounded from its initialization by

, and

is a Lipschitz activation function. We prove that if

is element-wise, then the sample complexity of

has only logarithmic dependency in width and that this complexity is tight, up to logarithmic factors. We further show that the element-wise property of

is essential for a logarithmic dependency bound in width, in the sense that there exist non-element-wise activation functions whose sample complexity is linear in width, for widths that can be up to exponential in the input dimension. For the upper bound, we use the recent approach for norm-based bounds named Approximate Description Length (ADL) by arXiv:1910.05697. We further develop new techniques and tools for this approach that will hopefully inspire future works.

Paper Structure (10 sections, 13 theorems, 24 equations)

This paper contains 10 sections, 13 theorems, 24 equations.

Introduction
Preliminaries
Notations
The Two-Layer Model
Approximate Description Length
Strong Shattering
Results and Contributions
Proof of Theorem \ref{['thm:main_upper']}
Proof of Theorem \ref{['thm:main_lower']}
Discussion and Open Questions

Key Result

Theorem 1

Fix a class $\mathcal{H}$ of functions from $\mathcal{X}$ to $\mathbb{R}$ with ADL $n(m)$ and a label space $\mathcal{Y}$. Fix $L$-Lipschitz and $B$-bounded loss function $\ell:\mathbb{R}\times \mathcal{Y}\to [0,\infty)$. Then, for any distribution $\mathcal{D}$ over $\mathcal{X}\times \mathcal{Y}$, where $\ell_\mathcal{D}(h)=\mathbb{E}_{(x,y)\sim\mathcal{D}}\ell(h(x),y)$ and $\ell_S(h)=\frac{1}{m

Theorems & Definitions (15)

Theorem 1
Lemma 2
Lemma 3
Lemma 4
Lemma 5
Lemma 6
Corollary 7
Definition 8
Theorem 9
Theorem 10
...and 5 more

On the Sample Complexity of Two-Layer Networks: Lipschitz vs. Element-Wise Lipschitz Activation

TL;DR

Abstract

On the Sample Complexity of Two-Layer Networks: Lipschitz vs. Element-Wise Lipschitz Activation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (15)