Table of Contents
Fetching ...

Implicit Regularization in Deep Learning

Behnam Neyshabur

TL;DR

The thesis argues that deep learning generalization arises largely from implicit regularization inherent in optimization rather than explicit architectural capacity. It develops a unified theory linking norm-based capacity control, margin-based PAC-Bayes bounds, and robustness concepts to explain generalization in overparameterized networks. A central contribution is the Path-Norm framework and Path-SGD, which respect invariances of neural networks and yield improved generalization in both feedforward and recurrent models. It further introduces data-dependent path normalization, bridging Batch Normalization and Path-SGD, and demonstrates through extensive theory and experiments that optimization geometry critically shapes learning outcomes. The work provides practical algorithms and theoretical insights that illuminate why large neural networks can generalize well and how to design training procedures that further enhance this generalization.

Abstract

In an attempt to better understand generalization in deep learning, we study several possible explanations. We show that implicit regularization induced by the optimization method is playing a key role in generalization and success of deep learning models. Motivated by this view, we study how different complexity measures can ensure generalization and explain how optimization algorithms can implicitly regularize complexity measures. We empirically investigate the ability of these measures to explain different observed phenomena in deep learning. We further study the invariances in neural networks, suggest complexity measures and optimization algorithms that have similar invariances to those in neural networks and evaluate them on a number of learning tasks.

Implicit Regularization in Deep Learning

TL;DR

The thesis argues that deep learning generalization arises largely from implicit regularization inherent in optimization rather than explicit architectural capacity. It develops a unified theory linking norm-based capacity control, margin-based PAC-Bayes bounds, and robustness concepts to explain generalization in overparameterized networks. A central contribution is the Path-Norm framework and Path-SGD, which respect invariances of neural networks and yield improved generalization in both feedforward and recurrent models. It further introduces data-dependent path normalization, bridging Batch Normalization and Path-SGD, and demonstrates through extensive theory and experiments that optimization geometry critically shapes learning outcomes. The work provides practical algorithms and theoretical insights that illuminate why large neural networks can generalize well and how to design training procedures that further enhance this generalization.

Abstract

In an attempt to better understand generalization in deep learning, we study several possible explanations. We show that implicit regularization induced by the optimization method is playing a key role in generalization and success of deep learning models. Motivated by this view, we study how different complexity measures can ensure generalization and explain how optimization algorithms can implicitly regularize complexity measures. We empirically investigate the ability of these measures to explain different observed phenomena in deep learning. We further study the invariances in neural networks, suggest complexity measures and optimization algorithms that have similar invariances to those in neural networks and evaluate them on a number of learning tasks.

Paper Structure

This paper contains 110 sections, 37 theorems, 193 equations, 18 figures, 5 tables.

Key Result

Lemma 1

Let $f_\mathbf{w}(\mathbf{x}):\mathcal{X}\rightarrow \mathbb{R}^{k}$ be any predictor (not necessarily a neural network) with parameters $\mathbf{w}$ and $P$ be any distribution on the parameters that is independent of the training data. For any $\gamma>0$, consider any set ${\mathcal{S}_{\mathbf{w} Let $\mathbf{u}$ be a random variable such that $\mathbb{P}\left[{\mathbf{u}}\in {\mathcal{S}_{\mat

Figures (18)

  • Figure 1: The training error and the test error based on different stopping criteria when 2-layer NNs with different number of hidden units are trained on MNIST and CIFAR-10. Images in both datasets are downsampled to 100 pixels. The size of the training set is 50000 for MNIST and 40000 for CIFAR-10. The early stopping is based on the error on a validation set (separate from the training and test sets) of size 10000. The training was done using stochastic gradient descent with momentum and mini-batches of size 100. The network was initialized with weights generated randomly from the Gaussian distribution. The initial step size and momentum were set to 0.1 and 0.5 respectively. After each epoch, we used the update rule $\mu^{(t+1)}=0.99\mu^{(t)}$ for the step size and $m^{(t+1)}=\min\{0.9,m^{(t)}+0.02\}$ for the momentum.
  • Figure 2: The training error and the test error based on different stopping criteria when 2-layer NNs with different number of hidden units are trained on small subsets of MNIST and CIFAR-10. Images in both datasets are downsampled to 100 pixels. The sizes of the training and validation sets are 2000 for both MNIST and CIFAR-10 and the early stopping is based on the error on the validation set. The top plots are the errors for the original datasets with and without explicit regularization.The best weight decay parameter is chosen based on the validation error. The middle plots are on the censored data set that is constructed by switching all the labels to agree with the predictions of a trained network with a small number $H_0$ of hidden units $H_0=4$ on MNIST and $H_0=16$ on CIFAR-10) on the entire dataset (train+test+validation). The plots on the bottom are also for the censored data except we also add 5 percent noise to the labels by randomly changing 5 percent of the labels. The optimization method is the same as the in Figure 1. The results in this figure are the average error over 5 random repetitions.
  • Figure 3: Verifying the conditions of Theorem \ref{['thm:relu']} on a 10 layer perceptron with 1000 hidden units in each layer, i.e. more than 10,000,000 parameters on MNIST. We have numerically checked that all values are within the displayed range. Left: $C1$: condition number of the network, i.e. $\frac{1}{\mu}$. Middle: $C2$: the ratio of activations that flip based on magnitude of perturbation. Right: $C3:$ the ratio of norm of incoming weights to each hidden units with respect to average of the same quantity over hidden units in the layer.
  • Figure 4: Condition $C1$: condition number $\frac{1}{\mu}$ of the network and its decomposition to two cases for random initialization and learned weights. Top: random initialization Bottom: learned weights. Left: distribution of all combinations of $a\leq c\leq b-1$. Middle: when $a<c<b-1$. Right: when $c=a$ or $c=b-1$.
  • Figure 5: Ratio of activations that flip based on the magnitude of perturbation. Left: random initialization. Middle: learned weights. Right: learned weights (zoomed in).
  • ...and 13 more figures

Theorems & Definitions (65)

  • Lemma 1
  • proof
  • Claim 2
  • proof
  • Theorem 3
  • Corollary 4
  • Theorem 5
  • proof
  • Theorem 6
  • Theorem 7
  • ...and 55 more