On the energy landscape of deep networks
Pratik Chaudhari, Stefano Soatto
TL;DR
The paper reframes deep-network training through the lens of spin-glass energy landscapes, introducing AnnealSGD, a gradient perturbation scheme that is annealed to smoothly traverse phases from exponential to polynomial to trivial landscape complexity. By linking random sparse deep networks to a $p$-spin Hamiltonian under a spherical constraint, it derives theoretical insights on the scaling of critical points and local minima under an external magnetic field. The proposed annealing strategy, grounded in this theory, accelerates training and improves generalization on fully-connected and convolutional networks, with empirical evidence on MNIST and CIFAR-10. The work also clarifies the role of gradient noise and provides practical guidance for implementing AnnealSGD in modern architectures, suggesting broader applicability and future extensions.
Abstract
We introduce "AnnealSGD", a regularized stochastic gradient descent algorithm motivated by an analysis of the energy landscape of a particular class of deep networks with sparse random weights. The loss function of such networks can be approximated by the Hamiltonian of a spherical spin glass with Gaussian coupling. While different from currently-popular architectures such as convolutional ones, spin glasses are amenable to analysis, which provides insights on the topology of the loss function and motivates algorithms to minimize it. Specifically, we show that a regularization term akin to a magnetic field can be modulated with a single scalar parameter to transition the loss function from a complex, non-convex landscape with exponentially many local minima, to a phase with a polynomial number of minima, all the way down to a trivial landscape with a unique minimum. AnnealSGD starts training in the relaxed polynomial regime and gradually tightens the regularization parameter to steer the energy towards the original exponential regime. Even for convolutional neural networks, which are quite unlike sparse random networks, we empirically show that AnnealSGD improves the generalization error using competitive baselines on MNIST and CIFAR-10.
