Table of Contents
Fetching ...

Optimization for deep learning: theory and algorithms

Ruoyu Sun

TL;DR

This paper surveys optimization theory and algorithms for training deep neural networks, addressing why training succeeds or fails and how to design effective training procedures. It centers on a structured view of optimization, covering gradient-based methods (SGD and variants), initialization and normalization tricks, architectural innovations like ResNet, and large-scale distributed training, while also examining global optimization aspects such as mode connectivity, lottery tickets, and infinite-width analyses via neural tangent kernels. Key contributions include linking practical training tricks to theoretical concepts (Lipschitz properties, dynamical isometry, NTK) and synthesizing results across convergence, speed, and global objectives to guide both theory and practice. The discussion highlights the practical impact of initialization, normalization, and architecture on trainability and generalization, and identifies promising directions such as robustness and reinforcement learning where theory-driven algorithms may yield the most benefit.

Abstract

When and why can a neural network be successfully trained? This article provides an overview of optimization algorithms and theory for training neural networks. First, we discuss the issue of gradient explosion/vanishing and the more general issue of undesirable spectrum, and then discuss practical solutions including careful initialization and normalization methods. Second, we review generic optimization methods used in training neural networks, such as SGD, adaptive gradient methods and distributed methods, and theoretical results for these algorithms. Third, we review existing research on the global issues of neural network training, including results on bad local minima, mode connectivity, lottery ticket hypothesis and infinite-width analysis.

Optimization for deep learning: theory and algorithms

TL;DR

This paper surveys optimization theory and algorithms for training deep neural networks, addressing why training succeeds or fails and how to design effective training procedures. It centers on a structured view of optimization, covering gradient-based methods (SGD and variants), initialization and normalization tricks, architectural innovations like ResNet, and large-scale distributed training, while also examining global optimization aspects such as mode connectivity, lottery tickets, and infinite-width analyses via neural tangent kernels. Key contributions include linking practical training tricks to theoretical concepts (Lipschitz properties, dynamical isometry, NTK) and synthesizing results across convergence, speed, and global objectives to guide both theory and practice. The discussion highlights the practical impact of initialization, normalization, and architecture on trainability and generalization, and identifies promising directions such as robustness and reinforcement learning where theory-driven algorithms may yield the most benefit.

Abstract

When and why can a neural network be successfully trained? This article provides an overview of optimization algorithms and theory for training neural networks. First, we discuss the issue of gradient explosion/vanishing and the more general issue of undesirable spectrum, and then discuss practical solutions including careful initialization and normalization methods. Second, we review generic optimization methods used in training neural networks, such as SGD, adaptive gradient methods and distributed methods, and theoretical results for these algorithms. Third, we review existing research on the global issues of neural network training, including results on bad local minima, mode connectivity, lottery ticket hypothesis and infinite-width analysis.

Paper Structure

This paper contains 37 sections, 35 equations, 4 figures.

Figures (4)

  • Figure 1: A few major design choices for a successful training of a neural network with theoretical understanding. They have impact on three aspects of algorithm convergence: make convergence possible, faster convergence and better global solutions. The three aspects are somewhat related, and it is jut a rough classification. Note that there are other important design choices, especially the neural architecture, that is not understood theoretically, and thus omitted in this figure. There are also other benefits such as generalization, which is also omitted.
  • Figure 2: Plot of the function $F(w) = (w^7-1)^2$, which illustrates the gradient explosion/vanishing issues. In the region $[-0.8, 0.8]$, the gradients almost vanish; in the region $[1.2, \infty]$ and $[ -\infty, -0.8 ]$, the gradients explode.
  • Figure 3: Illustration on wide minima and sharp minima.
  • Figure 4: Left figure: the flat region is not a set-wise strict local-min, and this region can be escaped by a (non-strictly) decreasing algorithm. Right figure: there is a basin that is a set-wise strict local-min.