Table of Contents
Fetching ...

Path-SGD: Path-Normalized Optimization in Deep Neural Networks

Behnam Neyshabur, Ruslan Salakhutdinov, Nathan Srebro

TL;DR

This paper argues that the standard Euclidean geometry used in SGD is ill-suited for RELU-based deep networks due to rescaling invariances. It introduces Path-SGD, an approximate steepest-descent method with respect to a path-regularizer φ_p(w) that is invariant to rescaling; φ_p(w) can be computed efficiently via dynamic programming, and the resulting updates are per-edge with a coordinate-wise rule that preserves rescaling equivalence. Empirical results on multiple benchmarks show Path-SGD outperforms SGD and AdaGrad, especially in unbalanced networks, and suggests Path-SGD offers implicit regularization benefits that improve generalization. The work presents a proof-of-concept for alternative optimization geometries in deep learning and encourages exploration of other invariant formulations and larger-scale applications.

Abstract

We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.

Path-SGD: Path-Normalized Optimization in Deep Neural Networks

TL;DR

This paper argues that the standard Euclidean geometry used in SGD is ill-suited for RELU-based deep networks due to rescaling invariances. It introduces Path-SGD, an approximate steepest-descent method with respect to a path-regularizer φ_p(w) that is invariant to rescaling; φ_p(w) can be computed efficiently via dynamic programming, and the resulting updates are per-edge with a coordinate-wise rule that preserves rescaling equivalence. Empirical results on multiple benchmarks show Path-SGD outperforms SGD and AdaGrad, especially in unbalanced networks, and suggests Path-SGD offers implicit regularization benefits that improve generalization. The work presents a proof-of-concept for alternative optimization geometries in deep learning and encourages exploration of other invariant formulations and larger-scale applications.

Abstract

We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.

Paper Structure

This paper contains 8 sections, 2 theorems, 12 equations, 5 figures, 1 table, 1 algorithm.

Key Result

Lemma 3.1

$\phi_p(w) = \min_{\tilde{w} \sim w} (\mu_{p,\infty}(\tilde{w}))^d$

Figures (5)

  • Figure 1: (a): Evolution of the cross-entropy error function when training a feed-forward network on MNIST with two hidden layers, each containing 4000 hidden units. The unbalanced initialization (blue curve) is generated by applying a sequence of rescaling functions on the balanced initializations (red curve). (b): Updates for a simple case where the input is $x=1$, thresholds are set to zero (constant), the stepsize is 1, and the gradient with respect to output is $\delta = -1$. (c): Updated network for the case where the input is $x=(1,1)$, thresholds are set to zero (constant), the stepsize is 1, and the gradient with respect to output is $\delta=(-1,-1)$.
  • Figure 2: Learning curves using different optimization methods for 4 datasets without dropout. Left panel displays the cross-entropy objective function; middle and right panels show the corresponding values of the training and test errors, where the values are reported on different epochs during the course of optimization. Best viewed in color.
  • Figure 3: Learning curves using different optimization methods for 4 datasets with dropout. Left panel displays the cross-entropy objective function; middle and right panels show the corresponding values of the training and test errors. Best viewed in color.
  • Figure 4: Learning curves for more number of epochs using different optimization methods for 4 datasets without dropout. Left panel displays the cross-entropy objective function; middle and right panels show the corresponding values of the training and test errors, where the values are reported on different epochs during the course of optimization. Best viewed in color.
  • Figure 5: Learning curves for more number of epochs using different optimization methods for 4 datasets with dropout. Left panel displays the cross-entropy objective function; middle and right panels show the corresponding values of the training and test errors. Best viewed in color.

Theorems & Definitions (3)

  • Lemma 3.1: neyshabur15
  • Theorem 4.1
  • proof