Table of Contents
Fetching ...

Towards Initialization-dependent and Non-vacuous Generalization Bounds for Overparameterized Shallow Neural Networks

Yunwen Lei, Yufeng Xie

Abstract

Overparameterized neural networks often show a benign overfitting property in the sense of achieving excellent generalization behavior despite the number of parameters exceeding the number of training examples. A promising direction to explain benign overfitting is to relate generalization to the norm of distance from initialization, motivated by the empirical observations that this distance is often significantly smaller than the norm itself. However, the existing initialization-dependent complexity analyses cannot fully exploit the power of initialization since the associated bounds depend on the spectral norm of the initialization matrix, which can scale as a square-root function of the width and are therefore not effective for overparameterized models. In this paper, we develop the first \emph{fully} initialization-dependent complexity bounds for shallow neural networks with general Lipschitz activation functions, which enjoys a logarithmic dependency on the width. Our bounds depend on the path-norm of the distance from initialization, which are derived by introducing a new peeling technique to handle the challenge along with the initialization-dependent constraint. We also develop a lower bound tight up to a constant factor. Finally, we conduct empirical comparisons and show that our generalization analysis implies non-vacuous bounds for overparameterized networks.

Towards Initialization-dependent and Non-vacuous Generalization Bounds for Overparameterized Shallow Neural Networks

Abstract

Overparameterized neural networks often show a benign overfitting property in the sense of achieving excellent generalization behavior despite the number of parameters exceeding the number of training examples. A promising direction to explain benign overfitting is to relate generalization to the norm of distance from initialization, motivated by the empirical observations that this distance is often significantly smaller than the norm itself. However, the existing initialization-dependent complexity analyses cannot fully exploit the power of initialization since the associated bounds depend on the spectral norm of the initialization matrix, which can scale as a square-root function of the width and are therefore not effective for overparameterized models. In this paper, we develop the first \emph{fully} initialization-dependent complexity bounds for shallow neural networks with general Lipschitz activation functions, which enjoys a logarithmic dependency on the width. Our bounds depend on the path-norm of the distance from initialization, which are derived by introducing a new peeling technique to handle the challenge along with the initialization-dependent constraint. We also develop a lower bound tight up to a constant factor. Finally, we conduct empirical comparisons and show that our generalization analysis implies non-vacuous bounds for overparameterized networks.

Paper Structure

This paper contains 15 sections, 11 theorems, 94 equations, 3 figures, 1 table.

Key Result

Theorem 1

Let $\mathcal{G}$ be defined in Eq. gcal and $\gamma(\cdot)$ be $G_\gamma$-Lipschitz. Then, we have and where $c_m=2\sqrt{2}(1+\frac{1}{2\log(2mc)})\log^{\frac{1}{2}}(2mc\lceil\log_2(2R_WR_V(cm)^{\frac{1}{2}}/\sup_{\mathbf{W},\mathbf{V}}\kappa(\mathbf{W},\mathbf{V}))\rceil)$. $\blacktriangleleft$$\blacktriangleleft$

Figures (3)

  • Figure 1: Behavior of $\|\mathbf{W}^{(0)}\|_\sigma$ and path norm with respect to the number of hidden units $m$ on the MNIST dataset. The gray area represents the value range under different random trials.
  • Figure 2: Dominant terms in the generalization bounds summarized in Table \ref{['tab:comp-gen']} on MNIST dataset. The gray area represents the value range under different random trials.
  • Figure 3: The comparison of existing generalization bounds with all the terms and constants across the MNIST and ijcnn1 datasets. The label corresponds to the index number in Table \ref{['tab:comp-gen']}. The gray area represents the value range under different random trials. SPN means generalization bounds based on standard path norm, i,e., Eq. \ref{['spn']}, and PN means our generalization bounds based on path norm, i.e., Eq. \ref{['constant-bound']}.

Theorems & Definitions (28)

  • Remark 1: Initialization-dependency
  • Definition 1: Vector-valued Rademacher complexity maurer2016vector
  • Definition 2: Path-norm
  • Remark 2: Standard path-norm
  • Theorem 1: Complexity Bounds
  • Remark 3: Comparison
  • Remark 4: Idea and Novelty
  • Theorem 2: Lower Bounds
  • Remark 5
  • Definition 3: Lipschitzness
  • ...and 18 more