Table of Contents
Fetching ...

Loss Landscape Characterization of Neural Networks without Over-Parametrization

Rustem Islamov, Niccolò Ajroldi, Antonio Orvieto, Aurelien Lucchi

TL;DR

This work proposes a novel class of functions that can characterize the loss landscape of modern deep models without requiring extensive over-parametrization and can also include saddle points, and proves that gradient-based optimizers possess theoretical guarantees of convergence under this assumption.

Abstract

Optimization methods play a crucial role in modern machine learning, powering the remarkable empirical achievements of deep learning models. These successes are even more remarkable given the complex non-convex nature of the loss landscape of these models. Yet, ensuring the convergence of optimization methods requires specific structural conditions on the objective function that are rarely satisfied in practice. One prominent example is the widely recognized Polyak-Lojasiewicz (PL) inequality, which has gained considerable attention in recent years. However, validating such assumptions for deep neural networks entails substantial and often impractical levels of over-parametrization. In order to address this limitation, we propose a novel class of functions that can characterize the loss landscape of modern deep models without requiring extensive over-parametrization and can also include saddle points. Crucially, we prove that gradient-based optimizers possess theoretical guarantees of convergence under this assumption. Finally, we validate the soundness of our new function class through both theoretical analysis and empirical experimentation across a diverse range of deep learning models.

Loss Landscape Characterization of Neural Networks without Over-Parametrization

TL;DR

This work proposes a novel class of functions that can characterize the loss landscape of modern deep models without requiring extensive over-parametrization and can also include saddle points, and proves that gradient-based optimizers possess theoretical guarantees of convergence under this assumption.

Abstract

Optimization methods play a crucial role in modern machine learning, powering the remarkable empirical achievements of deep learning models. These successes are even more remarkable given the complex non-convex nature of the loss landscape of these models. Yet, ensuring the convergence of optimization methods requires specific structural conditions on the objective function that are rarely satisfied in practice. One prominent example is the widely recognized Polyak-Lojasiewicz (PL) inequality, which has gained considerable attention in recent years. However, validating such assumptions for deep neural networks entails substantial and often impractical levels of over-parametrization. In order to address this limitation, we propose a novel class of functions that can characterize the loss landscape of modern deep models without requiring extensive over-parametrization and can also include saddle points. Crucially, we prove that gradient-based optimizers possess theoretical guarantees of convergence under this assumption. Finally, we validate the soundness of our new function class through both theoretical analysis and empirical experimentation across a diverse range of deep learning models.

Paper Structure

This paper contains 54 sections, 16 theorems, 121 equations, 19 figures, 3 tables, 4 algorithms.

Key Result

Theorem 1

Assume that Assumptions asmp:abc-asmp:inter_error hold. Then the iterates of SGD (Alg. alg:sgd) with stepsize $\gamma \le \frac{\alpha-\beta}{2L}$ satisfy

Figures (19)

  • Figure 1: Training of $3$ layer LSTM model that shows Aiming condition does not always hold. The term "Angle" in the figures refers to the angle $\angle(\nabla f(x^k), x^k-x^K)$, and it should be positive if Aiming holds, while in a-b we observe that it is negative during the first part of the training. Figures c-d demonstrate that possible constant $\mu$ in PL condition should be small which makes theoretical convergence slow.
  • Figure 2: Training for half-space learning problem with SGD. The term "Angle" in the figures refers to the angle $\angle(\nabla f(x^k), x^k-x^K)$.
  • Figure 3: Loss landscape of $f$ that satisfy \ref{['asmp:abc']}. The analytical form of $f_i$ is given in \ref{['sec:theory_examples']}. These examples demonstrate that the problem \ref{['eq:problem']} that satisfies $\alpha$-$\beta$-condition might have an unbounded set of minimizers $\mathcal{S}$ (\ref{['ex:example_1']}), a saddle point (\ref{['ex:example_2']}), and local minima (\ref{['ex:example_10']}) in contrast to the PL and Aiming conditions.
  • Figure 4: $\alpha$-$\beta$-condition in the training of 3 layer MLP model on Fashion-MNIST dataset varying the size of the second layer. Here $T(x_k) = \langle \nabla f_{i_k}(x^k),x^k-x^K\rangle - \alpha(f_{i_k}(x^k) - f_{i_k}(x^K)) - \beta f_{i_k}(x^k)$ assuming that $f_i^*=0.$ Minimum is taken across all runs and iterations for given pair of $(\alpha, \beta)$.
  • Figure 5: $\alpha$-$\beta$-condition in the training of CNN model on CIFAR10 dataset varying the number of convolutions in the second layer. Here $T(x_k) = \langle \nabla f_{i_k}(x^k),x^k-x^K\rangle - \alpha(f_{i_k}(x^k) - f_{i_k}(x^K)) - \beta f_{i_k}(x^k)$ assuming that $f_i^*=0.$ Minimum is taken across all runs and iterations for a given pair of $(\alpha, \beta)$.
  • ...and 14 more figures

Theorems & Definitions (48)

  • Definition 1: $\alpha$-$\beta$-condition
  • Example 1
  • Example 2
  • Remark 1
  • Example 3
  • Example 4
  • Example 5
  • Remark 2
  • Theorem 1
  • Theorem 2
  • ...and 38 more