Preconditioning for Accelerated Gradient Descent Optimization and Regularization
Qiang Ye
TL;DR
This work develops a unified preconditioning framework to explain acceleration in gradient-based optimization and its interaction with regularization and normalization in deep learning. By analyzing Hessian conditioning under a parameter transform $\mathbf{p}=P\mathbf{z}$, it shows how AdaGrad/RMSProp/Adam act as diagonal preconditioners and derives convergence implications, while revealing that $L_2$ regularization behaves differently under preconditioning than weight decay, with AdamW effectively regularizing the intrinsic transformed parameters. The paper also demonstrates that normalization methods—input data standardization, BatchNorm, and LayerNorm—improve conditioning by shaping the extended hidden variable matrices, linking their practical success to Hessian conditioning and offering pathways for implicit preconditioning (BNP). Collectively, these insights guide how to combine regularization with adaptive preconditioning and how to leverage normalization to accelerate training, with BNP proposed as a practical preconditioning-based alternative to explicit architectural changes.
Abstract
Accelerated training algorithms, such as adaptive learning rates (or preconditioning) and various normalization methods, are widely used but not fully understood. When regularization is introduced, standard optimizers like adaptive learning rates may not perform effectively. This raises the need for alternative regularization approaches such as AdamW and the question of how to properly combine regularization with preconditioning. In this paper, we address these challenges using the theory of preconditioning as follows: (1) We explain how AdaGrad, RMSProp, and Adam accelerates training through improving Hessian conditioning; (2) We explore the interaction between $L_2$-regularization and preconditioning, demonstrating that AdamW amounts to selecting the underlying intrinsic parameters for regularization, and we derive a generalization for the $L_1$-regularization; and (3) We demonstrate how various normalization methods such as input data normalization, batch normalization, and layer normalization accelerate training by improving Hessian conditioning. Our analysis offers a unified mathematical framework for understanding various acceleration techniques or deriving appropriate regularization schemes.
