Table of Contents
Fetching ...

Exploring Generalization in Deep Learning

Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, Nathan Srebro

TL;DR

The paper tackles why deep networks generalize despite massive parameter counts by evaluating norm-based, margin-based, Lipschitz, and sharpness measures, and linking sharpness to PAC-Bayes theory. It proposes scale-aware margins and path-norm-inspired capacity bounds, and derives a depth-linear sharpness bound within a PAC-Bayes framework, under explicit conditions. Empirically, joint measures combining expected sharpness with norms explain generalization trends (e.g., true vs. random labels, network scaling) better than sharpness alone, though no single metric captures all observed phenomena. It highlights optimization-induced implicit regularization as a key factor and outlines future work to connect learning dynamics with capacity control.

Abstract

With a goal of understanding what drives generalization in deep networks, we consider several recently suggested explanations, including norm-based control, sharpness and robustness. We study how these measures can ensure generalization, highlighting the importance of scale normalization, and making a connection between sharpness and PAC-Bayes theory. We then investigate how well the measures explain different observed phenomena.

Exploring Generalization in Deep Learning

TL;DR

The paper tackles why deep networks generalize despite massive parameter counts by evaluating norm-based, margin-based, Lipschitz, and sharpness measures, and linking sharpness to PAC-Bayes theory. It proposes scale-aware margins and path-norm-inspired capacity bounds, and derives a depth-linear sharpness bound within a PAC-Bayes framework, under explicit conditions. Empirically, joint measures combining expected sharpness with norms explain generalization trends (e.g., true vs. random labels, network scaling) better than sharpness alone, though no single metric captures all observed phenomena. It highlights optimization-induced implicit regularization as a key factor and outlines future work to connect learning dynamics with capacity control.

Abstract

With a goal of understanding what drives generalization in deep networks, we consider several recently suggested explanations, including norm-based control, sharpness and robustness. We study how these measures can ensure generalization, highlighting the importance of scale normalization, and making a connection between sharpness and PAC-Bayes theory. We then investigate how well the measures explain different observed phenomena.

Paper Structure

This paper contains 16 sections, 4 theorems, 35 equations, 8 figures.

Key Result

Theorem 1

Let $\boldsymbol{\nu}_i$ be a random $h_i \times h_{i-1}$ matrix with each entry distributed according to $\mathcal{N}(0,\sigma_i^2)$. Then, under the conditions $C1, C2, C3$, with probability $\geq 1-\delta$, where $\gamma_i = \frac{\sigma_i \sqrt{h_i} \sqrt{h_{i-1}}}{\mu^2 \|W_i\|_F}$ and $C_{\delta}=2\sqrt{\ln(dh/\delta)}$.

Figures (8)

  • Figure 1: Comparing different complexity measures on a VGG network trained on subsets of CIFAR10 dataset with true (blue line) or random (red line) labels. We plot norm divided by margin to avoid scaling issues (see Section \ref{['sec:summary']}), where for each complexity measure, we drop the terms that only depend on depth or number of hidden units; e.g. for $\ell_2$-path norm we plot $\gamma_{\text{margin}}^{-2}\sum_{j \in \prod_{k=0}^d[h_k]}\prod_{i=1}^d W_i^2[j_i,j_{i-1}]$.We also set the margin over training set $S$ to be $5^{th}$-percentile of the margins of the data points in $S$, i.e. $\text{Prc}_5\left\{f_\mathbf{w}(x_i)[y_i] - \max_{y\neq y_i} f_\mathbf{w}(\mathbf{x})[y] | (x_i,y_i)\in S\right\}$. In all experiments, the training error of the learned network is zero. The plots indicate that these measures can explain the generalization as the complexity of model learned with random labels is always higher than the one learned with true labels. Furthermore, the gap between the complexity of models learned with true and random labels increases as we increase the size of the training set.
  • Figure 2: Sharpness and PAC-Bayes measures on a VGG network trained on subsets of CIFAR10 dataset with true or random labels. In the left panel, we plot max sharpness, which we calculate as suggested by keskar2016large where the perturbation for parameter $w_i$ has magnitude $5.10^{-4}(\left\lvert{w_i}\right\rvert+1)$. The middle and right plots demonstrate the relationship between expected sharpness and KL divergence in PAC-Bayes analysis for true and random labels respectively. For PAC-Bayes plots, each point in the plot correspond to a choice of variable $\alpha$ where the standard deviation of the perturbation for the parameter $i$ is $\alpha(10\left\lvert{w_i}\right\rvert+1)$. The corresponding $KL$ to each $\alpha$ is nothing but weighted $\ell_2$ norm where the weight for each parameter is the inverse of the standard deviation of the perturbation.
  • Figure 3: Experiments on global minima with poor generalization. For each experiment, a VGG network is trained on union of a subset of CIFAR10 dataset with size 10000 containing samples with true labels and another subset of CIFAR10 datasets with varying size containing random labels. The learned networks are all global minima for the objective function on the subset with true labels. The left plot indicates the training and test errors based on the size of the set with random labels. The plot in the middle shows change in different measures based on the size of the set with random labels. The plot on the right indicates the relationship between expected sharpness and KL in PAC-bayes for each of the experiments. Measures are calculated as explained in Figures \ref{['fig:norm-true-random']} and \ref{['fig:sharpness-true-random']}.
  • Figure 4: The generalization of two layer perceptron trained on MNIST dataset with varying number of hidden units. The left plot indicates the training and test errors. The test error decreases as the size increases. The middle plot shows different measures for each of the trained networks. The plot on the right indicates the relationship between expected sharpness and KL in PAC-Bayes for each of the experiments. Measures are calculated as explained in Figures \ref{['fig:norm-true-random']} and \ref{['fig:sharpness-true-random']}.
  • Figure 5: Verifying the conditions of Theorem \ref{['thm:relu']} on a 10 layer perceptron with 1000 hidden units in each layer, i.e. more than 10,000,000 parameters on MNIST. We have numerically checked that all values are within the displayed range. Left: $C1$: condition number of the network, i.e. $\frac{1}{\mu}$. Middle: $C2$: the ratio of activations that flip based on magnitude of perturbation. Right: $C3:$ the ratio of norm of incoming weights to each hidden units with respect to average of the same quantity over hidden units in the layer.
  • ...and 3 more figures

Theorems & Definitions (7)

  • Theorem 1
  • Lemma 1
  • Lemma 2
  • proof
  • proof
  • Lemma 3
  • proof