Table of Contents
Fetching ...

Theory II: Landscape of the Empirical Risk in Deep Learning

Qianli Liao, Tomaso Poggio

TL;DR

This paper investigates how overparameterization shapes the empirical risk landscape in deep convolutional networks. It blends theoretical analysis using polynomial approximations and Bezout's theorem with CIFAR-10 experiments to argue that the landscape comprises many degenerate zero-error minima organized into basin-like regions, rather than a proliferation of problematic local minima. A simple baseline model of basins (and a basin-fractal variant) is proposed to explain training dynamics, perturbation effects, and interpolation outcomes, while highlighting that overparameterization does not harm generalization. The work suggests the loss surface may be simpler than commonly believed, motivating further study across architectures and data regimes.

Abstract

Previous theoretical work on deep learning and neural network optimization tend to focus on avoiding saddle points and local minima. However, the practical observation is that, at least in the case of the most successful Deep Convolutional Neural Networks (DCNNs), practitioners can always increase the network size to fit the training data (an extreme example would be [1]). The most successful DCNNs such as VGG and ResNets are best used with a degree of "overparametrization". In this work, we characterize with a mix of theory and experiments, the landscape of the empirical risk of overparametrized DCNNs. We first prove in the regression framework the existence of a large number of degenerate global minimizers with zero empirical error (modulo inconsistent equations). The argument that relies on the use of Bezout theorem is rigorous when the RELUs are replaced by a polynomial nonlinearity (which empirically works as well). As described in our Theory III [2] paper, the same minimizers are degenerate and thus very likely to be found by SGD that will furthermore select with higher probability the most robust zero-minimizer. We further experimentally explored and visualized the landscape of empirical risk of a DCNN on CIFAR-10 during the entire training process and especially the global minima. Finally, based on our theoretical and experimental results, we propose an intuitive model of the landscape of DCNN's empirical loss surface, which might not be as complicated as people commonly believe.

Theory II: Landscape of the Empirical Risk in Deep Learning

TL;DR

This paper investigates how overparameterization shapes the empirical risk landscape in deep convolutional networks. It blends theoretical analysis using polynomial approximations and Bezout's theorem with CIFAR-10 experiments to argue that the landscape comprises many degenerate zero-error minima organized into basin-like regions, rather than a proliferation of problematic local minima. A simple baseline model of basins (and a basin-fractal variant) is proposed to explain training dynamics, perturbation effects, and interpolation outcomes, while highlighting that overparameterization does not harm generalization. The work suggests the loss surface may be simpler than commonly believed, motivating further study across architectures and data regimes.

Abstract

Previous theoretical work on deep learning and neural network optimization tend to focus on avoiding saddle points and local minima. However, the practical observation is that, at least in the case of the most successful Deep Convolutional Neural Networks (DCNNs), practitioners can always increase the network size to fit the training data (an extreme example would be [1]). The most successful DCNNs such as VGG and ResNets are best used with a degree of "overparametrization". In this work, we characterize with a mix of theory and experiments, the landscape of the empirical risk of overparametrized DCNNs. We first prove in the regression framework the existence of a large number of degenerate global minimizers with zero empirical error (modulo inconsistent equations). The argument that relies on the use of Bezout theorem is rigorous when the RELUs are replaced by a polynomial nonlinearity (which empirically works as well). As described in our Theory III [2] paper, the same minimizers are degenerate and thus very likely to be found by SGD that will furthermore select with higher probability the most robust zero-minimizer. We further experimentally explored and visualized the landscape of empirical risk of a DCNN on CIFAR-10 during the entire training process and especially the global minima. Finally, based on our theoretical and experimental results, we propose an intuitive model of the landscape of DCNN's empirical loss surface, which might not be as complicated as people commonly believe.

Paper Structure

This paper contains 26 sections, 2 theorems, 8 equations, 45 figures.

Key Result

Corollary 1

In general, non-zero minima exist with higher dimensionality than the zero-error global minima: their dimensionality is the number of weights $K$ vs. the number of data points $N$. This is true in the linear case and also in the presence of ReLUs.

Figures (45)

  • Figure 1: The Landscape of empirical risk of overparametrized DCNN may be simply a collection of (perhaps slightly rugged) basins. (A) the profile view of a basin (B) the top-down view of a basin (C) example landscape of empirical risk (D) example perturbation: a small perturbation does not move the model out of its current basin, so re-training converges back to the bottom of the same basin. If the perturbation is large, re-training converges to another basin. (E) Example Interpolation: averaging two models within a basin tend to give a error that is the average of the two models (or less). Averaging two models between basins tend to give an error that is higher than both models. (F) Example optimization trajectories that correspond to Figure \ref{['fig:branch_layer_2_all_perturb_0.25']} (G), (H) see Section \ref{['sec:intuitive']}.
  • Figure 2: One can convert a deep network into a polynomial function by using polynomial nonlinearity. As long as the nonlinearity approximates ReLU well (especially near 0), the "polynomial net" performs similarly to a ReLU net. Our theory applies rigorously to a "polynomial net".
  • Figure 3: We train a 6-layer (with the 1st layer being the input) convolutional network on CIFAR-10 with stochastic gradient descent (batch size = 100). We divide the training process into 12 stages. In each stage, we perform 8 parallel SGDs with learning rate 0.01 for 10 epochs, resulting in 8 parallel trajectories denoted by different colors. Trajectories 1 to 4 in each stage start from the final model (denoted by $P$) of trajectory 1 of the previous stage. Trajectories 5 to 8 in each stage start from a perturbed version of $P$. The perturbation is performed by adding a gaussian noise to the weights of each layer with the standard deviation being 0.01 times layer's standard deviation. In general, we observe that running any trajectory with SGD again almost always leads to a slightly different convergence path. We plot the MDS results of all the layer 2 weights collected throughout all the training epochs from stage 1 to 12. Each number in the figure represents a model we collected during the above procedures. The points are in a 2D space generated by the MDS algorithm such that their pairwise distances are optimized to try to reflect those distances in the original high-dimensional space. The results of stages more than 5 are quite cluttered. So we applied a separate MDS to the stages 5 to 12. We also plot stage 1 and 5 separately for example. The trajectories of more stages are plotted in the Appendix.
  • Figure 4: Visualizing the exact training loss surface using Batch Gradient Descent (BGD). A DCNN is trained on CIFAR-10 from scratch using Batch Gradient Descent (BGD). The numbers are training errors. "NaN" corresponds to randomly initialized models (we did not evaluate them and assume they perform at chance). At epoch 0, 10, 50 and 200, we create a branch by perturbing the model by adding a Gaussian noise to all layers. The standard deviation of the Gaussian is 0.25*S, where S denotes the standard deviation of the weights in each layer, respectively. We also interpolate (by averaging) the models between the branches and the main trajectory, epoch by epoch. The interpolated models are evaluated on the entire training set to get a performance. First, surprisingly, BGD does not get stuck in any local minima, indicating some good properties of the landscape. The test error of solutions found by BGD is somewhat worse than those found by SGD, but not too much worse (BGD 40%, SGD 32%) . Another interesting observation is that as training proceeds, the same amount of perturbation are less able to lead to a drastically different trajectory. Nevertheless, a perturbation almost always leads to at least a slightly different model. The local neighborhood of the main trajectory seems to be relatively flat and contain many good solutions, supporting our theoretical predictions. It is also intriguing to see interpolated models to have very reasonable performance. The results here are based on weights from layer 2. The results of other layers are similar and are shown in the Appendix.
  • Figure 5: Verifying the flatness of global minima: We layerwise perturb the weights of model $M_{final}$ (which is at a global minimum) by adding a gaussian noise with standard deviation = 0.1 * S, where S is the standard deviation of the weights. After perturbation, we continue training the model with 200 epochs of gradient descent (i.e., batch size = training set size). The same procedure was performed 4 times, resulting in 4 curves shown in the figures. The training and test classification errors and losses are shown in A1, A2, B1 and B2. The MDS visualization of 4 trajectories (denoted by 4 colors) is shown in C1 --- the 4 trajectories converge to different solutions. The MDS visualization of one trajectory is in C2. In addition, we show the confusion matrices of converged models in the Appendix to verify that they are indeed different models. More similar experiments can be found in the Appendix.
  • ...and 40 more figures

Theorems & Definitions (2)

  • Corollary 1
  • Proposition 1