Table of Contents
Fetching ...

Rotational Equilibrium: How Weight Decay Balances Learning Across Neural Networks

Atli Kosson, Bettina Messmer, Martin Jaggi

TL;DR

The paper introduces rotational equilibrium as a geometric framework to understand how weight decay modulates neuron-level updates in modern networks with normalization. It derives equilibrium norms and angular-update predictions for several optimizers (SGDM, AdamW, Adam+$\ell_2$, Lion) within a random-walk model and proposes Rotational Variants (RVs) that fix angular updates to emulate the benefits of weight decay without WD. Empirically, balanced rotation across layers and neurons—promoted by Weight Standardization and RVs—explains AdamW’s empirical advantage over Adam+$\ell_2$ and helps reduce the need for learning-rate warmup. The work further demonstrates that controlling rotation can maintain steady optimization dynamics across scale-invariant and scale-sensitive parameters, with practical implications for training stability and hyperparameter tuning.

Abstract

This study investigates how weight decay affects the update behavior of individual neurons in deep neural networks through a combination of applied analysis and experimentation. Weight decay can cause the expected magnitude and angular updates of a neuron's weight vector to converge to a steady state we call rotational equilibrium. These states can be highly homogeneous, effectively balancing the average rotation -- a proxy for the effective learning rate -- across different layers and neurons. Our work analyzes these dynamics across optimizers like Adam, Lion, and SGD with momentum, offering a new simple perspective on training that elucidates the efficacy of widely used but poorly understood methods in deep learning. We demonstrate how balanced rotation plays a key role in the effectiveness of normalization like Weight Standardization, as well as that of AdamW over Adam with L2-regularization. Finally, we show that explicitly controlling the rotation provides the benefits of weight decay while substantially reducing the need for learning rate warmup.

Rotational Equilibrium: How Weight Decay Balances Learning Across Neural Networks

TL;DR

The paper introduces rotational equilibrium as a geometric framework to understand how weight decay modulates neuron-level updates in modern networks with normalization. It derives equilibrium norms and angular-update predictions for several optimizers (SGDM, AdamW, Adam+, Lion) within a random-walk model and proposes Rotational Variants (RVs) that fix angular updates to emulate the benefits of weight decay without WD. Empirically, balanced rotation across layers and neurons—promoted by Weight Standardization and RVs—explains AdamW’s empirical advantage over Adam+ and helps reduce the need for learning-rate warmup. The work further demonstrates that controlling rotation can maintain steady optimization dynamics across scale-invariant and scale-sensitive parameters, with practical implications for training stability and hyperparameter tuning.

Abstract

This study investigates how weight decay affects the update behavior of individual neurons in deep neural networks through a combination of applied analysis and experimentation. Weight decay can cause the expected magnitude and angular updates of a neuron's weight vector to converge to a steady state we call rotational equilibrium. These states can be highly homogeneous, effectively balancing the average rotation -- a proxy for the effective learning rate -- across different layers and neurons. Our work analyzes these dynamics across optimizers like Adam, Lion, and SGD with momentum, offering a new simple perspective on training that elucidates the efficacy of widely used but poorly understood methods in deep learning. We demonstrate how balanced rotation plays a key role in the effectiveness of normalization like Weight Standardization, as well as that of AdamW over Adam with L2-regularization. Finally, we show that explicitly controlling the rotation provides the benefits of weight decay while substantially reducing the need for learning rate warmup.
Paper Structure (39 sections, 58 equations, 19 figures, 6 tables, 1 algorithm)

This paper contains 39 sections, 58 equations, 19 figures, 6 tables, 1 algorithm.

Figures (19)

  • Figure 1: Conceptual figure of the norm (left) and angular updates (right) of the weight vector ${\bm{\omega}}_t$ for different neurons (each line color) over time $t$ with a constant learning rate. Weight decay modulates and stabilizes both metrics.
  • Figure 2: Two views of equilibrium where the weight norm ${\hbox{$\widehat{\|{\bm{\omega}}\|}$}}$ is preserved because the gradient and weight decay components balance out on average. Left: Standard optimizer update \ref{['eq:abstract_update']}. Right: The total update contributions over the course of training, ${\bm{u}}$ and ${\bm{d}}$, derived from the gradient and weight decay of a given timestep, respectfully.
  • Figure 3: Measured weight norms and average rotation for different layers (solid colors) in two real neural network training tasks, ResNet-50 on ImageNet-1k (SGDM) and Weight Standardized GPT2-124M on OpenWebText (AdamW). The predicted equilibrium rotation (dashed black) from \ref{['tab:equilibrium_summary']} holds very well for all scale-invariant layers. The final fully-connected layer in RN-50 (pink) is not scale-invariant with a radial gradient component that decreases the effective weight decay, slowing the rotation (see \ref{['sec:scale_sensitive_dynamics']}). The learning rate is constant for easier comparison.
  • Figure 4: Weight decay influences transient behavior and how fast weights are updated relative to biases. Left: Validation accuracy for ResNet-20 on CIFAR-10 for learning rate, weight decay pairs with a constant product ($\eta\lambda=5\!\cdot\!10^{-4}$) resulting in a specific $\widehat{\eta_r}$ but different $\widehat{\eta_g}$ and $\widehat{\|{\bm{\omega}}\|}$ (\ref{['tab:equilibrium_summary']}). Middle/Right: The weight norm $\|{\bm{\omega}}\|$ and angular update size $\eta_r$ over time for three $(\eta,\lambda)$ pairs corresponding to the colored circles on the left with equilibrium predictions in dashed red.
  • Figure 5: The RVs display a reduced need for learning rate warmup compared to standard optimizers, offering insights into the utility of warmups. Left: GPT2-124M OWT loss curve with/without learning rate warmup. Middle: GPT2-124M OWT final validation loss for different learning rates with AdamW/RV-AdamW and with/without warmup. Right: ResNet-50 i1k validation accuracy (short, large batch training) for different learning rates with SGDM/RV-SGDM, with/without warmup.
  • ...and 14 more figures