Table of Contents
Fetching ...

Towards Guided Descent: Optimization Algorithms for Training Neural Networks At Scale

Ansh Nagwekar

TL;DR

The work investigates why neural network optimization remains challenging and how principled curvature-aware and modular norms can guide scalable training. It surveys classical methods, develops a unifying modular norm framework, and analyzes curvature matrices (Hessian, GGN, FIM) with practical approximations like KFAC, EKFAC, and Shampoo. It introduces μP and related training paradigms to enable transfer of hyperparameters and stable scaling, while detailing practical engineering strategies to deploy curvature-aware methods at scale. The result is a principled, workflow-ready perspective that connects theory from optimization and information geometry to modern deep learning practice, offering concrete prescriptions and modular tools for efficient, scalable training of billion-parameter models.

Abstract

Neural network optimization remains one of the most consequential yet poorly understood challenges in modern AI research, where improvements in training algorithms can lead to enhanced feature learning in foundation models, order-of-magnitude reductions in training time, and improved interpretability into how networks learn. While stochastic gradient descent (SGD) and its variants have become the de facto standard for training deep networks, their success in these over-parameterized regimes often appears more empirical than principled. This thesis investigates this apparent paradox by tracing the evolution of optimization algorithms from classical first-order methods to modern higher-order techniques, revealing how principled algorithmic design can demystify the training process. Starting from first principles with SGD and adaptive gradient methods, the analysis progressively uncovers the limitations of these conventional approaches when confronted with anisotropy that is representative of real-world data. These breakdowns motivate the exploration of sophisticated alternatives rooted in curvature information: second-order approximation techniques, layer-wise preconditioning, adaptive learning rates, and more. Next, the interplay between these optimization algorithms and the broader neural network training toolkit, which includes prior and recent developments such as maximal update parametrization, learning rate schedules, and exponential moving averages, emerges as equally essential to empirical success. To bridge the gap between theoretical understanding and practical deployment, this paper offers practical prescriptions and implementation strategies for integrating these methods into modern deep learning workflows.

Towards Guided Descent: Optimization Algorithms for Training Neural Networks At Scale

TL;DR

The work investigates why neural network optimization remains challenging and how principled curvature-aware and modular norms can guide scalable training. It surveys classical methods, develops a unifying modular norm framework, and analyzes curvature matrices (Hessian, GGN, FIM) with practical approximations like KFAC, EKFAC, and Shampoo. It introduces μP and related training paradigms to enable transfer of hyperparameters and stable scaling, while detailing practical engineering strategies to deploy curvature-aware methods at scale. The result is a principled, workflow-ready perspective that connects theory from optimization and information geometry to modern deep learning practice, offering concrete prescriptions and modular tools for efficient, scalable training of billion-parameter models.

Abstract

Neural network optimization remains one of the most consequential yet poorly understood challenges in modern AI research, where improvements in training algorithms can lead to enhanced feature learning in foundation models, order-of-magnitude reductions in training time, and improved interpretability into how networks learn. While stochastic gradient descent (SGD) and its variants have become the de facto standard for training deep networks, their success in these over-parameterized regimes often appears more empirical than principled. This thesis investigates this apparent paradox by tracing the evolution of optimization algorithms from classical first-order methods to modern higher-order techniques, revealing how principled algorithmic design can demystify the training process. Starting from first principles with SGD and adaptive gradient methods, the analysis progressively uncovers the limitations of these conventional approaches when confronted with anisotropy that is representative of real-world data. These breakdowns motivate the exploration of sophisticated alternatives rooted in curvature information: second-order approximation techniques, layer-wise preconditioning, adaptive learning rates, and more. Next, the interplay between these optimization algorithms and the broader neural network training toolkit, which includes prior and recent developments such as maximal update parametrization, learning rate schedules, and exponential moving averages, emerges as equally essential to empirical success. To bridge the gap between theoretical understanding and practical deployment, this paper offers practical prescriptions and implementation strategies for integrating these methods into modern deep learning workflows.

Paper Structure

This paper contains 138 sections, 3 theorems, 127 equations, 18 figures.

Key Result

Theorem 4.1

For gradient matrices $G_1, \ldots, G_L$ and sharpness $\lambda > 0$: is solved by layerwise sign descent: $\Delta W_l = -\eta \cdot \mathrm{sign}(G_l)$ for all $l$, with step size $\eta = \frac{1}{\lambda} \sum_{l=1}^L \|G_l\|^*_{\ell_1 \to \ell_\infty}$.

Figures (18)

  • Figure 1: Figure 9.11 from boyd2004convex depicting steepest descent. The ellipses represent the boundaries of the "norm ball" produced under the quadratic norm.
  • Figure 2: Figure 9.19 from boyd2004convex. Here, the elipses look curvature-aware --- they seem to mold their shape based the local geometry of the loss function.
  • Figure 3: Figure 2 from chaudhari2017entropy depicting how local entropy tend to gather around wider valleys and away from narrow regions, corresponding to the robustness of the solution.
  • Figure 4: Figure 2 from keskar2017improving showing the testing error on DenseNet architecture for CIFAR-10 with a varying switchover point (switching from Adam to SGD during training).
  • Figure 5: Figure 6 from agarwal2020ggt. Plots comparing adaptive training algorithms on CIFAR-10 and PTB language-level modeling. AdaGrad seems unstable, while GGT doesn't outperform SGD or Adam significantly on test accuracy or validation perplexity.
  • ...and 13 more figures

Theorems & Definitions (4)

  • Theorem 4.1: Adam as Steepest Descent bernstein2024normanthology
  • Theorem 4.2: Shampoo as Steepest Descent bernstein2024normanthology
  • Definition 4.1: Modular Norm bernstein2024normanthology
  • Theorem 4.3: Steepest Descent under the Modular Norm bernstein2024normanthology