Table of Contents
Fetching ...

Geometry of Optimization and Implicit Regularization in Deep Learning

Behnam Neyshabur, Ryota Tomioka, Ruslan Salakhutdinov, Nathan Srebro

TL;DR

The paper confronts the question of why deep networks generalize beyond simple capacity considerations by positing that the geometry of optimization acts as an implicit regularizer. It introduces Path-SGD, a rescaling-invariant optimizer tied to the path-norm $\phi_p(w) = ||\pi(w)||_p$, to realize this implicit regularization in practice. Through theory and experiments on standard benchmarks, it shows that optimization geometry can fundamentally influence generalization, enabling wider networks to generalize without explicit regularization. The work highlights the potential of designing optimizers with problem-specific invariances and paves the way for extending these ideas to larger architectures and convolutional networks.

Abstract

We argue that the optimization plays a crucial role in generalization of deep learning models through implicit regularization. We do this by demonstrating that generalization ability is not controlled by network size but rather by some other implicit control. We then demonstrate how changing the empirical optimization procedure can improve generalization, even if actual optimization quality is not affected. We do so by studying the geometry of the parameter space of deep networks, and devising an optimization algorithm attuned to this geometry.

Geometry of Optimization and Implicit Regularization in Deep Learning

TL;DR

The paper confronts the question of why deep networks generalize beyond simple capacity considerations by positing that the geometry of optimization acts as an implicit regularizer. It introduces Path-SGD, a rescaling-invariant optimizer tied to the path-norm , to realize this implicit regularization in practice. Through theory and experiments on standard benchmarks, it shows that optimization geometry can fundamentally influence generalization, enabling wider networks to generalize without explicit regularization. The work highlights the potential of designing optimizers with problem-specific invariances and paves the way for extending these ideas to larger architectures and convolutional networks.

Abstract

We argue that the optimization plays a crucial role in generalization of deep learning models through implicit regularization. We do this by demonstrating that generalization ability is not controlled by network size but rather by some other implicit control. We then demonstrate how changing the empirical optimization procedure can improve generalization, even if actual optimization quality is not affected. We do so by studying the geometry of the parameter space of deep networks, and devising an optimization algorithm attuned to this geometry.

Paper Structure

This paper contains 9 sections, 2 theorems, 12 equations, 3 figures, 1 table, 1 algorithm.

Key Result

Lemma 4.1

$\phi_p(w) = \min_{\tilde{w} \sim w} (\mu_{p,\infty}(\tilde{w}))^d$

Figures (3)

  • Figure 1: The training error and the test error based on different stopping criteria when 2-layer NNs with different number of hidden units are trained on MNIST and CIFAR-10. Images in both datasets are downsampled to 100 pixels. The size of the training set is 50000 for MNIST and 40000 for CIFAR-10. The early stopping is based on the error on a validation set (separate from the training and test sets) of size 10000. The training was done using stochastic gradient descent with momentum and mini-batches of size 100. The network was initialized with weights generated randomly from the Gaussian distribution. The initial step size and momentum were set to 0.1 and 0.5 respectively. After each epoch, we used the update rule $\mu^{(t+1)}=0.99\mu^{(t)}$ for the step size and $m^{(t+1)}=\min\{0.9,m^{(t)}+0.02\}$ for the momentum.
  • Figure 2: (a): Evolution of the cross-entropy error function when training a feed-forward network on MNIST with two hidden layers, each containing 4000 hidden units. The unbalanced initialization (blue curve) is generated by applying a sequence of rescaling functions on the balanced initializations (red curve). (b): Updates for a simple case where the input is $x=1$, thresholds are set to zero (constant), the stepsize is 1, and the gradient with respect to output is $\delta = -1$. (c): Updated network for the case where the input is $x=(1,1)$, thresholds are set to zero (constant), the stepsize is 1, and the gradient with respect to output is $\delta=(-1,-1)$.
  • Figure 3: Learning curves using different optimization methods for 4 datasets without dropout. Left panel displays the cross-entropy objective function; middle and right panels show the corresponding values of the training and test errors, where the values are reported on different epochs during the course of optimization.We tried both balanced and unbalanced initializations. In balanced initialization, incoming weights to each unit $v$ are initialized to i.i.d samples from a Gaussian distribution with standard deviation $1/\sqrt{\text{fan-in}(v)}$. In the unbalanced setting, we first initialized the weights to be the same as the balanced weights. We then picked 2000 hidden units randomly with replacement. For each unit, we multiplied its incoming edge and divided its outgoing edge by $10c$, where $c$ was chosen randomly from log-normal distribution. Although we proved that Path-SGD updates are the same for balanced and unbalanced initializations, to verify that despite numerical issues they are indeed identical, we trained Path-SGD with both balanced and unbalanced initializations. Since the curves were exactly the same we only show a single curve. Best viewed in color.

Theorems & Definitions (2)

  • Lemma 4.1: neyshabur2015norm
  • Theorem 5.1