Table of Contents
Fetching ...

Generalization Bounds of Stochastic Gradient Descent in Homogeneous Neural Networks

Wenquan Ma, Yang Sui, Jiaye Teng, Bohan Wang, Jing Xu, Jingqin Yang

TL;DR

The generalization bounds under the homogeneous neural network regimes are derived, proving that this regime enables slower stepsize decay of order $\Omega(1/\sqrt{t})$ under mild assumptions.

Abstract

Algorithmic stability is among the most potent techniques in generalization analysis. However, its derivation usually requires a stepsize $η_t = \mathcal{O}(1/t)$ under non-convex training regimes, where $t$ denotes iterations. This rigid decay of the stepsize potentially impedes optimization and may not align with practical scenarios. In this paper, we derive the generalization bounds under the homogeneous neural network regimes, proving that this regime enables slower stepsize decay of order $Ω(1/\sqrt{t})$ under mild assumptions. We further extend the theoretical results from several aspects, e.g., non-Lipschitz regimes. This finding is broadly applicable, as homogeneous neural networks encompass fully-connected and convolutional neural networks with ReLU and LeakyReLU activations.

Generalization Bounds of Stochastic Gradient Descent in Homogeneous Neural Networks

TL;DR

The generalization bounds under the homogeneous neural network regimes are derived, proving that this regime enables slower stepsize decay of order under mild assumptions.

Abstract

Algorithmic stability is among the most potent techniques in generalization analysis. However, its derivation usually requires a stepsize under non-convex training regimes, where denotes iterations. This rigid decay of the stepsize potentially impedes optimization and may not align with practical scenarios. In this paper, we derive the generalization bounds under the homogeneous neural network regimes, proving that this regime enables slower stepsize decay of order under mild assumptions. We further extend the theoretical results from several aspects, e.g., non-Lipschitz regimes. This finding is broadly applicable, as homogeneous neural networks encompass fully-connected and convolutional neural networks with ReLU and LeakyReLU activations.
Paper Structure (31 sections, 22 theorems, 111 equations, 4 figures, 2 tables)

This paper contains 31 sections, 22 theorems, 111 equations, 4 figures, 2 tables.

Key Result

Proposition 1

Consider two types of layers: For convenience, we omit the bias when the context is clear. The $h$-layer neural network $\Phi_H({\boldsymbol{w}}; {X})$ with $h-H$ normalized layers $\mathcal{N}_{\boldsymbol{w}}$ and $H$ unnormalized layers $\mathcal{U}_{\boldsymbol{w}}$ is $H$-homogeneous, where

Figures (4)

  • Figure 1: The training and test accuracy curves for SGD with various stepsize schedulers on CIFAR-10 using ResNet 18. Models trained with stepsize $\Theta(1/\sqrt{t})$ (dotted line) converge faster and perform better than those trained with stepsize $\Theta(1/t)$ (solid line).
  • Figure 2: (left) Training accuracy of SGD with different schedulers over a three-layer ReLU network. The initial learning rate is selected via grid search over $\{10^{-5}, 10^{-4},\cdots, 10^1\}$. SGD with stepsize $\Theta(1/t\Vert {\boldsymbol{w}}_t \Vert )$ achieves similar training accuracy as SGD with stepsize $\Theta(1/\sqrt{t})$, both outperforming SGD with stepsize $\Theta(1/t)$; (right) When $T_a = o(n/\tilde{\gamma}^{1/m_1})$, the training and test loss curve does not decrease at $t \in (T_a, o(n/\tilde{\gamma}^{1/m_1}))$.
  • Figure 3: An instance of construction for $H=3$. Following the rules in Proposition \ref{['prop: homo-property']}, normalized layers (left) accomodate arbitrary activations and connections with $0$-homogeneity, while the subsequent unnormalized layers (right) compose a $3$-layer MLP to achieve the desired three-homogeneity.
  • Figure :

Theorems & Definitions (40)

  • Definition 1
  • Proposition 1: Homogeneity Construction
  • Lemma 1
  • Proposition 2
  • Theorem 1: Generalization Under Homogeneity
  • Corollary 1
  • Corollary 2
  • Theorem 2
  • Example 3: Compatibility between Optimization and Generalization
  • Definition 2: Conditional On-Average Stability
  • ...and 30 more