Three Mechanisms of Weight Decay Regularization
Guodong Zhang, Chaoqi Wang, Bowen Xu, Roger Grosse
TL;DR
This work interrogates why weight decay regularizes neural networks and demonstrates that its effects cannot be fully captured by $L_2$ regularization alone. Across SGD, Adam, and K-FAC, weight decay yields three distinct mechanisms of regularization that depend on architecture (notably BN) and the optimizer: (1) increasing the effective learning rate in first-order methods, (2) approximating Jacobian (Gauss-Newton) regularization via curvature norms under second-order methods, and (3) preventing the damping term from overpowering curvature to preserve second-order behavior. The findings illuminate how weight decay can be leveraged to improve generalization, reduce optimizer and batch-size gaps, and guide more principled regularization strategies, especially for BN-rich and second-order–aware training. The results suggest practical advice for selecting regularization schemes and highlight avenues for automatic adaptation of optimization hyperparameters. Overall, the paper provides a nuanced, mechanism-based understanding of weight decay in modern neural network training.</br>
Abstract
Weight decay is one of the standard tricks in the neural network toolbox, but the reasons for its regularization effect are poorly understood, and recent results have cast doubt on the traditional interpretation in terms of $L_2$ regularization. Literal weight decay has been shown to outperform $L_2$ regularization for optimizers for which they differ. We empirically investigate weight decay for three optimization algorithms (SGD, Adam, and K-FAC) and a variety of network architectures. We identify three distinct mechanisms by which weight decay exerts a regularization effect, depending on the particular optimization algorithm and architecture: (1) increasing the effective learning rate, (2) approximately regularizing the input-output Jacobian norm, and (3) reducing the effective damping coefficient for second-order optimization. Our results provide insight into how to improve the regularization of neural networks.
