Table of Contents
Fetching ...

Three Mechanisms of Weight Decay Regularization

Guodong Zhang, Chaoqi Wang, Bowen Xu, Roger Grosse

TL;DR

This work interrogates why weight decay regularizes neural networks and demonstrates that its effects cannot be fully captured by $L_2$ regularization alone. Across SGD, Adam, and K-FAC, weight decay yields three distinct mechanisms of regularization that depend on architecture (notably BN) and the optimizer: (1) increasing the effective learning rate in first-order methods, (2) approximating Jacobian (Gauss-Newton) regularization via curvature norms under second-order methods, and (3) preventing the damping term from overpowering curvature to preserve second-order behavior. The findings illuminate how weight decay can be leveraged to improve generalization, reduce optimizer and batch-size gaps, and guide more principled regularization strategies, especially for BN-rich and second-order–aware training. The results suggest practical advice for selecting regularization schemes and highlight avenues for automatic adaptation of optimization hyperparameters. Overall, the paper provides a nuanced, mechanism-based understanding of weight decay in modern neural network training.</br>

Abstract

Weight decay is one of the standard tricks in the neural network toolbox, but the reasons for its regularization effect are poorly understood, and recent results have cast doubt on the traditional interpretation in terms of $L_2$ regularization. Literal weight decay has been shown to outperform $L_2$ regularization for optimizers for which they differ. We empirically investigate weight decay for three optimization algorithms (SGD, Adam, and K-FAC) and a variety of network architectures. We identify three distinct mechanisms by which weight decay exerts a regularization effect, depending on the particular optimization algorithm and architecture: (1) increasing the effective learning rate, (2) approximately regularizing the input-output Jacobian norm, and (3) reducing the effective damping coefficient for second-order optimization. Our results provide insight into how to improve the regularization of neural networks.

Three Mechanisms of Weight Decay Regularization

TL;DR

This work interrogates why weight decay regularizes neural networks and demonstrates that its effects cannot be fully captured by regularization alone. Across SGD, Adam, and K-FAC, weight decay yields three distinct mechanisms of regularization that depend on architecture (notably BN) and the optimizer: (1) increasing the effective learning rate in first-order methods, (2) approximating Jacobian (Gauss-Newton) regularization via curvature norms under second-order methods, and (3) preventing the damping term from overpowering curvature to preserve second-order behavior. The findings illuminate how weight decay can be leveraged to improve generalization, reduce optimizer and batch-size gaps, and guide more principled regularization strategies, especially for BN-rich and second-order–aware training. The results suggest practical advice for selecting regularization schemes and highlight avenues for automatic adaptation of optimization hyperparameters. Overall, the paper provides a nuanced, mechanism-based understanding of weight decay in modern neural network training.</br>

Abstract

Weight decay is one of the standard tricks in the neural network toolbox, but the reasons for its regularization effect are poorly understood, and recent results have cast doubt on the traditional interpretation in terms of regularization. Literal weight decay has been shown to outperform regularization for optimizers for which they differ. We empirically investigate weight decay for three optimization algorithms (SGD, Adam, and K-FAC) and a variety of network architectures. We identify three distinct mechanisms by which weight decay exerts a regularization effect, depending on the particular optimization algorithm and architecture: (1) increasing the effective learning rate, (2) approximately regularizing the input-output Jacobian norm, and (3) reducing the effective damping coefficient for second-order optimization. Our results provide insight into how to improve the regularization of neural networks.

Paper Structure

This paper contains 22 sections, 3 theorems, 37 equations, 9 figures, 3 tables, 1 algorithm.

Key Result

Lemma 1

For a feed-forward neural network of depth $L$ with ReLU activation function and no biases, the network's outputs are related to the input-output Jacobian and parameter-output Jacobian as follows:

Figures (9)

  • Figure 1: Comparison of test accuracy of the networks trained with different optimizers on both CIFAR10 and CIFAR100. We compare Weight Decay regularization to $L_2$ regularization and the Baseline (which used neither). Here, BN+Aug denotes the use of BN and data augmentation. K-FAC-G and K-FAC-F denote K-FAC using Gauss-Newton and Fisher matrices as the preconditioner, respectively. The results suggest that weight decay leads to improved performance across different optimizers and settings.
  • Figure 2: Test accuracy as a function of training epoch for SGD and Adam on CIFAR-100 with different weight decay regularization schemes. baseline is the model without weight decay; wd-conv is the model with weight decay applied to all convolutional layers; wd-all is the model with weight decay applied to all layers; wd-fc is the model with weight decay applied to the last layer (fc). Most of the generalization effect of weight decay is due to applying it to layers with BN.
  • Figure 3: Effective learning rate of the first layer of ResNet32 trained with SGD on CIFAR-100. Without weight decay regularization, the effective learning rate decreases quickly in the beginning.
  • Figure 4: The curves of test accuracies of ResNet32 on CIFAR-100. To be noted, we use wd and wn to denote weight decay and weight normalization respectively.
  • Figure 5: Relationship between K-FAC GN norm and Jacobian norm for practical deep neural networks. Each point corresponds to a network trained to $100\%$ training accuracy. Even for (nonlinear) classification networks, the K-FAC GN norm is highly correlated with both the squared Frobenius norm of the input-output Jacobian and the generalization gap.
  • ...and 4 more figures

Theorems & Definitions (7)

  • Lemma 1: Gradient structure
  • Lemma 2: K-FAC Gauss-Newton Norm
  • Theorem 1: Approximate Jacobian norm
  • proof
  • proof
  • proof
  • proof