Generalization Bounds of Stochastic Gradient Descent in Homogeneous Neural Networks

Wenquan Ma; Yang Sui; Jiaye Teng; Bohan Wang; Jing Xu; Jingqin Yang

Generalization Bounds of Stochastic Gradient Descent in Homogeneous Neural Networks

Wenquan Ma, Yang Sui, Jiaye Teng, Bohan Wang, Jing Xu, Jingqin Yang

TL;DR

The generalization bounds under the homogeneous neural network regimes are derived, proving that this regime enables slower stepsize decay of order $\Omega(1/\sqrt{t})$ under mild assumptions.

Abstract

Algorithmic stability is among the most potent techniques in generalization analysis. However, its derivation usually requires a stepsize $η_t = \mathcal{O}(1/t)$ under non-convex training regimes, where $t$ denotes iterations. This rigid decay of the stepsize potentially impedes optimization and may not align with practical scenarios. In this paper, we derive the generalization bounds under the homogeneous neural network regimes, proving that this regime enables slower stepsize decay of order $Ω(1/\sqrt{t})$ under mild assumptions. We further extend the theoretical results from several aspects, e.g., non-Lipschitz regimes. This finding is broadly applicable, as homogeneous neural networks encompass fully-connected and convolutional neural networks with ReLU and LeakyReLU activations.

Generalization Bounds of Stochastic Gradient Descent in Homogeneous Neural Networks

TL;DR

The generalization bounds under the homogeneous neural network regimes are derived, proving that this regime enables slower stepsize decay of order

under mild assumptions.

Abstract

Algorithmic stability is among the most potent techniques in generalization analysis. However, its derivation usually requires a stepsize

under non-convex training regimes, where

denotes iterations. This rigid decay of the stepsize potentially impedes optimization and may not align with practical scenarios. In this paper, we derive the generalization bounds under the homogeneous neural network regimes, proving that this regime enables slower stepsize decay of order

under mild assumptions. We further extend the theoretical results from several aspects, e.g., non-Lipschitz regimes. This finding is broadly applicable, as homogeneous neural networks encompass fully-connected and convolutional neural networks with ReLU and LeakyReLU activations.

Paper Structure (31 sections, 22 theorems, 111 equations, 4 figures, 2 tables)

This paper contains 31 sections, 22 theorems, 111 equations, 4 figures, 2 tables.

Introduction
Our Results in More Detail.
Related Works
Homogeneity
Generalization Under Homogeneity
Assumptions
Generalization Under Homogeneity
Generalization Under Three-Homogeneity
Additional Discussions on Optimization Performance
Generalization Beyond Lipschitz
Stability and Generalization
Generalization Beyond Lipschitz
Conclusion
Additional Discussions
Additional Related Work
...and 16 more sections

Key Result

Proposition 1

Consider two types of layers: For convenience, we omit the bias when the context is clear. The $h$-layer neural network $\Phi_H({\boldsymbol{w}}; {X})$ with $h-H$ normalized layers $\mathcal{N}_{\boldsymbol{w}}$ and $H$ unnormalized layers $\mathcal{U}_{\boldsymbol{w}}$ is $H$-homogeneous, where

Figures (4)

Figure 1: The training and test accuracy curves for SGD with various stepsize schedulers on CIFAR-10 using ResNet 18. Models trained with stepsize $\Theta(1/\sqrt{t})$ (dotted line) converge faster and perform better than those trained with stepsize $\Theta(1/t)$ (solid line).
Figure 2: (left) Training accuracy of SGD with different schedulers over a three-layer ReLU network. The initial learning rate is selected via grid search over $\{10^{-5}, 10^{-4},\cdots, 10^1\}$. SGD with stepsize $\Theta(1/t\Vert {\boldsymbol{w}}_t \Vert )$ achieves similar training accuracy as SGD with stepsize $\Theta(1/\sqrt{t})$, both outperforming SGD with stepsize $\Theta(1/t)$; (right) When $T_a = o(n/\tilde{\gamma}^{1/m_1})$, the training and test loss curve does not decrease at $t \in (T_a, o(n/\tilde{\gamma}^{1/m_1}))$.
Figure 3: An instance of construction for $H=3$. Following the rules in Proposition \ref{['prop: homo-property']}, normalized layers (left) accomodate arbitrary activations and connections with $0$-homogeneity, while the subsequent unnormalized layers (right) compose a $3$-layer MLP to achieve the desired three-homogeneity.
Figure :

Theorems & Definitions (40)

Definition 1
Proposition 1: Homogeneity Construction
Lemma 1
Proposition 2
Theorem 1: Generalization Under Homogeneity
Corollary 1
Corollary 2
Theorem 2
Example 3: Compatibility between Optimization and Generalization
Definition 2: Conditional On-Average Stability
...and 30 more

Generalization Bounds of Stochastic Gradient Descent in Homogeneous Neural Networks

TL;DR

Abstract

Generalization Bounds of Stochastic Gradient Descent in Homogeneous Neural Networks

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (40)