Table of Contents
Fetching ...

A Generalization Bound for Nearly-Linear Networks

Eugene Golikov

TL;DR

Novel generalization bounds that become non-vacuous for networks that are close to being linear are presented, which are the first non-vacuous generalization bounds for neural nets possessing this property.

Abstract

We consider nonlinear networks as perturbations of linear ones. Based on this approach, we present novel generalization bounds that become non-vacuous for networks that are close to being linear. The main advantage over the previous works which propose non-vacuous generalization bounds is that our bounds are a-priori: performing the actual training is not required for evaluating the bounds. To the best of our knowledge, they are the first non-vacuous generalization bounds for neural nets possessing this property.

A Generalization Bound for Nearly-Linear Networks

TL;DR

Novel generalization bounds that become non-vacuous for networks that are close to being linear are presented, which are the first non-vacuous generalization bounds for neural nets possessing this property.

Abstract

We consider nonlinear networks as perturbations of linear ones. Based on this approach, we present novel generalization bounds that become non-vacuous for networks that are close to being linear. The main advantage over the previous works which propose non-vacuous generalization bounds is that our bounds are a-priori: performing the actual training is not required for evaluating the bounds. To the best of our knowledge, they are the first non-vacuous generalization bounds for neural nets possessing this property.
Paper Structure (58 sections, 5 theorems, 117 equations, 9 figures)

This paper contains 58 sections, 5 theorems, 117 equations, 9 figures.

Key Result

Theorem 4.2

Fix $\beta, \gamma > 0$, $t \geq 0$, $\delta \in (0,1)$, $\epsilon \in [0,1]$, and $\kappa \in \{1,2\}$. Let $p$ be the floating point arithmetic precision (32 by default). Under the setting of sec:setup and ass:loss_proj, for any weight initialization satisfying $\|W^\epsilon_l(0)\| \leq \beta$$\fo where and where where $\rho = \tfrac{\|W^\epsilon_1(0)\|_F}{\|W^\epsilon_1(0)\|}$ is the square

Figures (9)

  • Figure 1: We consider 7x7 binary MNIST, $L=2$, $\kappa=2$, $\epsilon=0.001$, and vary $\beta$. The bound of \ref{['thm:general_bound_binary']} converges as $\beta$ vanishes and increases as $\beta$ grows. The bound stays non-vacuous for a small enough $\beta$ and a properly choosen $\gamma$. We consider $\gamma = \beta^2 / q$ for $q \in \{1, 10, 100\}$.
  • Figure 2: We consider 7x7 binary MNIST, $L=2$, $\beta=0.001$, $\epsilon=0.001$, and compare different kappas of \ref{['thm:general_bound_binary']}. The bound for $\kappa = 2$ is much stronger than that for $\kappa = 1$.
  • Figure 3: We consider 7x7 binary MNIST, $L=2$, $\beta=0.001$, $\epsilon=0.01$, $\kappa=2$, and vary the stable rank at initialization $\rho$ and floating point precision $p$. Initializing the input layer with a rank one matrix considerably improves the bound. Moreover, it also improves the convergence speed.
  • Figure 4: We consider 7x7 binary MNIST, $L=2$, $\beta=0.001$, $\epsilon=0.01$, $\kappa=2$, and compare different components of the bound. The left figure corresponds to the full bound, while for the central one we forget about the generalization gap bound for the proxy model $\Upsilon_\kappa$, and for the rightmost one, we forget about the deviation term $\tfrac{\Delta_{\kappa,\beta}\epsilon^\kappa}{\gamma}$. We see that both terms are of the same order; one therefore has to work on reducing both in order to reduce the overall bound.
  • Figure 5: We consider binary MNIST, $L=2$, $\beta=0.001$, $\epsilon=0.001$, $\kappa=2$, $p=16$, rank one input layer initialization, and vary image dimensions. In this "gentle" scenario, the bound stays non-vacuous for 14x14 MNIST, and only slightly exceeds the random guess risk for the full-sized, 28x28 MNIST.
  • ...and 4 more figures

Theorems & Definitions (8)

  • Theorem 4.2
  • Lemma 6.1
  • Lemma 6.2
  • proof
  • Lemma 6.3
  • Lemma 6.4
  • Conjecture C.1
  • proof : Proof of \ref{['conj:gradient_alignment']} for $\epsilon = 0$