Optimization for deep learning: theory and algorithms
Ruoyu Sun
TL;DR
This paper surveys optimization theory and algorithms for training deep neural networks, addressing why training succeeds or fails and how to design effective training procedures. It centers on a structured view of optimization, covering gradient-based methods (SGD and variants), initialization and normalization tricks, architectural innovations like ResNet, and large-scale distributed training, while also examining global optimization aspects such as mode connectivity, lottery tickets, and infinite-width analyses via neural tangent kernels. Key contributions include linking practical training tricks to theoretical concepts (Lipschitz properties, dynamical isometry, NTK) and synthesizing results across convergence, speed, and global objectives to guide both theory and practice. The discussion highlights the practical impact of initialization, normalization, and architecture on trainability and generalization, and identifies promising directions such as robustness and reinforcement learning where theory-driven algorithms may yield the most benefit.
Abstract
When and why can a neural network be successfully trained? This article provides an overview of optimization algorithms and theory for training neural networks. First, we discuss the issue of gradient explosion/vanishing and the more general issue of undesirable spectrum, and then discuss practical solutions including careful initialization and normalization methods. Second, we review generic optimization methods used in training neural networks, such as SGD, adaptive gradient methods and distributed methods, and theoretical results for these algorithms. Third, we review existing research on the global issues of neural network training, including results on bad local minima, mode connectivity, lottery ticket hypothesis and infinite-width analysis.
