Table of Contents
Fetching ...

Neural network optimization strategies and the topography of the loss landscape

Jianneng Yu, Alexandre V. Morozov

TL;DR

SGD explores smooth basins of attraction, while quasi-Newton optimization is capable of finding deeper, more isolated minima that are more spread out in the parameter space, which helps understand both the topography of the loss landscapes and the fundamental role of landscape exploration strategies in creating robust, transferrable neural network models.

Abstract

Neural networks are trained by optimizing multi-dimensional sets of fitting parameters on non-convex loss landscapes. Low-loss regions of the landscapes correspond to the parameter sets that perform well on the training data. A key issue in machine learning is the performance of trained neural networks on previously unseen test data. Here, we investigate neural network training by stochastic gradient descent (SGD) - a non-convex global optimization algorithm which relies only on the gradient of the objective function. We contrast SGD solutions with those obtained via a non-stochastic quasi-Newton method, which utilizes curvature information to determine step direction and Golden Section Search to choose step size. We use several computational tools to investigate neural network parameters obtained by these two optimization methods, including kernel Principal Component Analysis and a novel, general-purpose algorithm for finding low-height paths between pairs of points on loss or energy landscapes, FourierPathFinder. We find that the choice of the optimizer profoundly affects the nature of the resulting solutions. SGD solutions tend to be separated by lower barriers than quasi-Newton solutions, even if both sets of solutions are regularized by early stopping to ensure adequate performance on test data. When allowed to fit extensively on the training data, quasi-Newton solutions occupy deeper minima on the loss landscapes that are not reached by SGD. These solutions are less generalizable to the test data however. Overall, SGD explores smooth basins of attraction, while quasi-Newton optimization is capable of finding deeper, more isolated minima that are more spread out in the parameter space. Our findings help understand both the topography of the loss landscapes and the fundamental role of landscape exploration strategies in creating robust, transferrable neural network models.

Neural network optimization strategies and the topography of the loss landscape

TL;DR

SGD explores smooth basins of attraction, while quasi-Newton optimization is capable of finding deeper, more isolated minima that are more spread out in the parameter space, which helps understand both the topography of the loss landscapes and the fundamental role of landscape exploration strategies in creating robust, transferrable neural network models.

Abstract

Neural networks are trained by optimizing multi-dimensional sets of fitting parameters on non-convex loss landscapes. Low-loss regions of the landscapes correspond to the parameter sets that perform well on the training data. A key issue in machine learning is the performance of trained neural networks on previously unseen test data. Here, we investigate neural network training by stochastic gradient descent (SGD) - a non-convex global optimization algorithm which relies only on the gradient of the objective function. We contrast SGD solutions with those obtained via a non-stochastic quasi-Newton method, which utilizes curvature information to determine step direction and Golden Section Search to choose step size. We use several computational tools to investigate neural network parameters obtained by these two optimization methods, including kernel Principal Component Analysis and a novel, general-purpose algorithm for finding low-height paths between pairs of points on loss or energy landscapes, FourierPathFinder. We find that the choice of the optimizer profoundly affects the nature of the resulting solutions. SGD solutions tend to be separated by lower barriers than quasi-Newton solutions, even if both sets of solutions are regularized by early stopping to ensure adequate performance on test data. When allowed to fit extensively on the training data, quasi-Newton solutions occupy deeper minima on the loss landscapes that are not reached by SGD. These solutions are less generalizable to the test data however. Overall, SGD explores smooth basins of attraction, while quasi-Newton optimization is capable of finding deeper, more isolated minima that are more spread out in the parameter space. Our findings help understand both the topography of the loss landscapes and the fundamental role of landscape exploration strategies in creating robust, transferrable neural network models.
Paper Structure (12 sections, 5 equations, 10 figures, 2 tables, 2 algorithms)

This paper contains 12 sections, 5 equations, 10 figures, 2 tables, 2 algorithms.

Figures (10)

  • Figure 1: Generation and visualization of optimized NN parameter sets. (a) A subset of input training/test data. A single $28 \times 28$ MNIST image 6296535 is used as NN input. (b) Representative NN architecture, with an input layer, two hidden layers, and an output layer. (c) A conceptual sketch of the corresponding NN loss landscape, with two local minima (red dots) located in a shallow valley. The basins of attraction of the two minima are separated by a relatively low barrier.
  • Figure 2: LSTM loss curves. Representative LSTM training (a) and test (b) loss curves ($\langle l(x^\text{train},\omega) \rangle$ and $\langle l(x^\text{test},\omega) \rangle$, respectively) as a function of the number of epochs. In both panels, dashed vertical lines mark the epochs where the loss curves of the same color reach their minima. The weight configurations $\omega^{\text{train}}_{\text{SGD}},\omega^{\text{test}}_{\text{SGD}}$ and $\omega^{\text{train}}_{\text{BFGS}},\omega^{\text{test}}_{\text{BFGS}}$ denote the sets of NN parameters found at these minima (optimized with SGD and L-BFGS-GSS, respectively).
  • Figure 3: Low-loss paths on a 2D landscape. (a) Two-dimensional loss landscape composed of two positive and two negative Gaussian peaks: $f(x, y) = -\sum_{i=1}^{2} \exp[-3|\mathbf{r} - \mathbf{c}_i|^2] + \sum_{j=1}^{2} \exp[-15|\mathbf{r} - \mathbf{d}_j|^2] + C$, where $\mathbf{r} = (x,y)$, $\mathbf{c}_1 = (-0.5, -0.5)$, $\mathbf{c}_2 = (0.5, 0.0)$, $\mathbf{d}_1 = (-0.2, -0.4)$, $\mathbf{d}_2 = (0.0, 0.3)$, and $C = 1.019$. Four representative paths connecting two landscape minima: $\mathbf{w}_1 = (-0.62, -0.54)$ and $\mathbf{w}_2 = (0.49, -0.02)$ are shown: a linear interpolation path (dashed blue line) and three FourierPathFinder optimized paths ($\lambda = 10$, solid blue curve; $\lambda = 100$, solid orange curve; $\lambda = 1000$, solid green curve). Dots indicate function values at discrete time steps $t_m \in [0,1]$ along the path: $f(x(t_m),y(t_m))$, $m = 1 \dots M$ ($M = 100$). (b) Loss values $f(x(t),y(t))$ as a function of the curve parameter $t$ along the four paths in panel (a): the linear interpolation path (dashed blue curve) and three FourierPathFinder paths (solid curves with the colors matching the paths in panel (a)). Path heights $H_i$ (Eq. \ref{['H']}) are labeled with black dots, with $H_0 = 1.102$ (straight line), $H_1 = 0.646$ (optimized path, $\lambda = 10$), $H_2 = 0.640$ (optimized path, $\lambda = 100$), and $H_3 = 0.652$ (optimized path, $\lambda = 1000$).
  • Figure 4: Distribution of barrier heights along optimized paths on the training landscape. Shown are distributions of the FourierPathFinder path heights (Eq. \ref{['H']}) for FCP (a), LeNet (b), Autoencoder (c), and LSTM (d). Histograms in each panel show heights of $300$ low-loss paths connecting randomly chosen pairs of optimized parameter vectors in $\{ \omega^{\text{train},i}_{\text{BFGS}} \}_{i=1}^{48}$ (navy blue), $\{ \omega^{\text{test},i}_{\text{BFGS}} \}_{i=1}^{48}$ (light blue), $\{ \omega^{\text{train},i}_{\text{SGD}} \}_{i=1}^{48}$ (light red), and $\{ \omega^{\text{test},i}_{\text{SGD}} \}_{i=1}^{48}$ (gold). The paths are computed on the training landscape, $\langle l(x^\text{train},\omega) \rangle$.
  • Figure 5: Distributions of SGD test and BFGS training weight vector lengths. Shown are the histograms of $L_2$ distances between individual weight vectors $\omega_i$ and the common origin $\Bar{\omega}$, $|\omega^i-\Bar{\omega}|$. Distributions of the BFGS training ($\{ \omega^{\text{train},i}_{\text{BFGS}} \}_{i=1}^{48}$) and SGD test ($\{ \omega^{\text{test},i}_{\text{SGD}} \}_{i=1}^{48}$) weight vector lengths are plotted in blue and light red, respectively, for FCP (a), LeNet (b), Autoencoder (c), and LSTM (d). Dotted vertical lines indicate the positions of $|\bar{\omega}_{\text{BFGS}} - \bar{\omega}| = |\bar{\omega}_{\text{SGD}} - \bar{\omega}|$.
  • ...and 5 more figures