Table of Contents
Fetching ...

On the energy landscape of deep networks

Pratik Chaudhari, Stefano Soatto

TL;DR

The paper reframes deep-network training through the lens of spin-glass energy landscapes, introducing AnnealSGD, a gradient perturbation scheme that is annealed to smoothly traverse phases from exponential to polynomial to trivial landscape complexity. By linking random sparse deep networks to a $p$-spin Hamiltonian under a spherical constraint, it derives theoretical insights on the scaling of critical points and local minima under an external magnetic field. The proposed annealing strategy, grounded in this theory, accelerates training and improves generalization on fully-connected and convolutional networks, with empirical evidence on MNIST and CIFAR-10. The work also clarifies the role of gradient noise and provides practical guidance for implementing AnnealSGD in modern architectures, suggesting broader applicability and future extensions.

Abstract

We introduce "AnnealSGD", a regularized stochastic gradient descent algorithm motivated by an analysis of the energy landscape of a particular class of deep networks with sparse random weights. The loss function of such networks can be approximated by the Hamiltonian of a spherical spin glass with Gaussian coupling. While different from currently-popular architectures such as convolutional ones, spin glasses are amenable to analysis, which provides insights on the topology of the loss function and motivates algorithms to minimize it. Specifically, we show that a regularization term akin to a magnetic field can be modulated with a single scalar parameter to transition the loss function from a complex, non-convex landscape with exponentially many local minima, to a phase with a polynomial number of minima, all the way down to a trivial landscape with a unique minimum. AnnealSGD starts training in the relaxed polynomial regime and gradually tightens the regularization parameter to steer the energy towards the original exponential regime. Even for convolutional neural networks, which are quite unlike sparse random networks, we empirically show that AnnealSGD improves the generalization error using competitive baselines on MNIST and CIFAR-10.

On the energy landscape of deep networks

TL;DR

The paper reframes deep-network training through the lens of spin-glass energy landscapes, introducing AnnealSGD, a gradient perturbation scheme that is annealed to smoothly traverse phases from exponential to polynomial to trivial landscape complexity. By linking random sparse deep networks to a -spin Hamiltonian under a spherical constraint, it derives theoretical insights on the scaling of critical points and local minima under an external magnetic field. The proposed annealing strategy, grounded in this theory, accelerates training and improves generalization on fully-connected and convolutional networks, with empirical evidence on MNIST and CIFAR-10. The work also clarifies the role of gradient noise and provides practical guidance for implementing AnnealSGD in modern architectures, suggesting broader applicability and future extensions.

Abstract

We introduce "AnnealSGD", a regularized stochastic gradient descent algorithm motivated by an analysis of the energy landscape of a particular class of deep networks with sparse random weights. The loss function of such networks can be approximated by the Hamiltonian of a spherical spin glass with Gaussian coupling. While different from currently-popular architectures such as convolutional ones, spin glasses are amenable to analysis, which provides insights on the topology of the loss function and motivates algorithms to minimize it. Specifically, we show that a regularization term akin to a magnetic field can be modulated with a single scalar parameter to transition the loss function from a complex, non-convex landscape with exponentially many local minima, to a phase with a polynomial number of minima, all the way down to a trivial landscape with a unique minimum. AnnealSGD starts training in the relaxed polynomial regime and gradually tightens the regularization parameter to steer the energy towards the original exponential regime. Even for convolutional neural networks, which are quite unlike sparse random networks, we empirically show that AnnealSGD improves the generalization error using competitive baselines on MNIST and CIFAR-10.

Paper Structure

This paper contains 23 sections, 9 theorems, 59 equations, 4 figures, 1 table, 1 algorithm.

Key Result

Theorem 1

If the true label $Y^t \sim \mathrm{Ber}(q)$ for some $q < 1$, the zero-one loss $\mathbb{E}_X |\widehat{Y} - Y^t|$ has the same distribution as up to an additive constant. $J_{i_1, \ldots, i_p}$ is a zero-mean, standard Gaussian random variable, $J \in \mathbb{R}$ is a constant and $\sigma \in S^{n-1}(\sqrt{n})$.

Figures (4)

  • Figure 1: The number of critical points in $(-\infty, u]$ scales as $e^{n \Theta_k(u)}$. The black curve denotes $\Theta_0(u)$, the complexity of local minima. Below $\lim_{n\to \infty}\ \inf H(\sigma)/n = -E_0$ which is where $\Theta_0(u)$ intersects the $x$-axis, there are no local minima in the logarithmic scaling auffinger2013random. Similarly, below the point where $\Theta_k(u)$ intersects the $x$-axis, there are no critical points of index higher than $k$. The Hamiltonian thus shows a layered landscape, higher-order saddle points to the right and in fact, local minima concentrated near $-E_0$ on the far left subag2015extremal.
  • Figure 2: Two-dimensional t-SNE van2008visualizing of $20,000$ of local minima discovered by gradient descent for the three regimes: exponential (Fig. \ref{['fig:p3t0']}), polynomial (Fig. \ref{['fig:p3t1']}) and trivial (Fig. \ref{['fig:p3t2']}). The background is colored by the kernel density estimate of the value of the normalized Hamiltonian $H/n$. The numerical values of the normalized Hamiltonian in Fig. \ref{['fig:p3t0']} are the non-asymptotic versions ($n=100$) of the values in Fig. \ref{['fig:energy_barriers']}.
  • Figure 3: Fig. \ref{['fig:mnistfc_test']}: $\mathrm{mnistfc}$ trained with ADAM (blue) vs. AnnealSGD (red). Fig. \ref{['fig:mnistfc_grad']}: Minimum absolute value of the back-propagated gradient during training for ADAM (blue) vs. AnnealSGD (red). Fig. \ref{['fig:mnistconv']} shows the validation error for LeNet trained using ADAM (blue) vs. AnnealSGD (red).
  • Figure 4: Alignment of the weights with the perturbation term $\left( \left| h^\top \sigma \right|\right)$ for All-CNN-BN trained with AnnealSGD (red) vs. resampled, annealed additive noise (green) (cf. Table \ref{['tab:cifar10']} for error).

Theorems & Definitions (13)

  • Theorem 1
  • proof
  • Theorem 2: fyodorov2013high
  • Theorem 3: fyodorov2013high
  • Theorem 4: fyodorov2013high
  • Theorem 5
  • Lemma A-1: auffinger2013random, Lem. 3.2
  • Lemma A-2
  • proof
  • Remark A-3
  • ...and 3 more