Path-SGD: Path-Normalized Optimization in Deep Neural Networks
Behnam Neyshabur, Ruslan Salakhutdinov, Nathan Srebro
TL;DR
This paper argues that the standard Euclidean geometry used in SGD is ill-suited for RELU-based deep networks due to rescaling invariances. It introduces Path-SGD, an approximate steepest-descent method with respect to a path-regularizer φ_p(w) that is invariant to rescaling; φ_p(w) can be computed efficiently via dynamic programming, and the resulting updates are per-edge with a coordinate-wise rule that preserves rescaling equivalence. Empirical results on multiple benchmarks show Path-SGD outperforms SGD and AdaGrad, especially in unbalanced networks, and suggests Path-SGD offers implicit regularization benefits that improve generalization. The work presents a proof-of-concept for alternative optimization geometries in deep learning and encourages exploration of other invariant formulations and larger-scale applications.
Abstract
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.
