Table of Contents
Fetching ...

Decoupled Orthogonal Dynamics: Regularization for Deep Network Optimizers

Hao Chen, Jinghui Yuan, Hanmin Zhang

TL;DR

The work identifies a Radial Tug-of-War in AdamW where magnitude growth and directional learning conflict, potentially destabilizing training. It proposes Orthogonal Dynamics Decoupling (AdamO), which splits each parameter block into radial and tangential subspaces, applying a pure radial SGD-like update to control norm and confining Adam-style preconditioning to the tangential directions, with projection steps to maintain subspace separation. A curvature-adaptive radial step size and architecture-aware updates further tailor the optimization to the geometry of the parameter space, including scale-invariant layers. Across CIFAR-100 and Grokking-type tasks, AdamO yields stronger generalization and more stable training than AdamW or AdamP, demonstrating the value of geometry-respecting, decoupled optimizer dynamics in deep learning.

Abstract

Is the standard weight decay in AdamW truly optimal? Although AdamW decouples weight decay from adaptive gradient scaling, a fundamental conflict remains: the Radial Tug-of-War. In deep learning, gradients tend to increase parameter norms to expand effective capacity while steering directions to learn features, whereas weight decay indiscriminately suppresses norm growth. This push--pull interaction induces radial oscillations, injecting noise into Adam's second-moment estimates and potentially degrading delicate tangential feature learning. We argue that magnitude and direction play distinct roles and should be decoupled in optimizer dynamics. We propose Orthogonal Dynamics Decoupling and instantiate it as AdamO: an SGD-style update handles the one-dimensional norm control, while Adam's adaptive preconditioning is confined to the tangential subspace. AdamO further incorporates curvature-adaptive radial step sizing and architecture-aware rules and projections for scale-invariant layers and low-dimensional parameters. Experiments on vision and language tasks show that AdamO improves generalization and stability over AdamW without introducing additional complex constraints.

Decoupled Orthogonal Dynamics: Regularization for Deep Network Optimizers

TL;DR

The work identifies a Radial Tug-of-War in AdamW where magnitude growth and directional learning conflict, potentially destabilizing training. It proposes Orthogonal Dynamics Decoupling (AdamO), which splits each parameter block into radial and tangential subspaces, applying a pure radial SGD-like update to control norm and confining Adam-style preconditioning to the tangential directions, with projection steps to maintain subspace separation. A curvature-adaptive radial step size and architecture-aware updates further tailor the optimization to the geometry of the parameter space, including scale-invariant layers. Across CIFAR-100 and Grokking-type tasks, AdamO yields stronger generalization and more stable training than AdamW or AdamP, demonstrating the value of geometry-respecting, decoupled optimizer dynamics in deep learning.

Abstract

Is the standard weight decay in AdamW truly optimal? Although AdamW decouples weight decay from adaptive gradient scaling, a fundamental conflict remains: the Radial Tug-of-War. In deep learning, gradients tend to increase parameter norms to expand effective capacity while steering directions to learn features, whereas weight decay indiscriminately suppresses norm growth. This push--pull interaction induces radial oscillations, injecting noise into Adam's second-moment estimates and potentially degrading delicate tangential feature learning. We argue that magnitude and direction play distinct roles and should be decoupled in optimizer dynamics. We propose Orthogonal Dynamics Decoupling and instantiate it as AdamO: an SGD-style update handles the one-dimensional norm control, while Adam's adaptive preconditioning is confined to the tangential subspace. AdamO further incorporates curvature-adaptive radial step sizing and architecture-aware rules and projections for scale-invariant layers and low-dimensional parameters. Experiments on vision and language tasks show that AdamO improves generalization and stability over AdamW without introducing additional complex constraints.
Paper Structure (24 sections, 8 equations, 4 figures, 3 tables, 1 algorithm)

This paper contains 24 sections, 8 equations, 4 figures, 3 tables, 1 algorithm.

Figures (4)

  • Figure 1: Visualization of neural network training results using Adam and AdamO. AdamO exhibits completely different dynamics compared to Adam, reflected in significantly smaller norms and noticeably smoother decision boundaries.
  • Figure 2: Validation accuracy over 200 epochs on CIFAR-100 for AdamW, AdamP, and AdamO under the same training budget and scheduler. AdamO consistently attains higher validation accuracy and shows larger gains after learning-rate drops.
  • Figure 3: Hyperparameter sensitivity heatmaps on CIFAR-100. Left: AdamW grid over (learning rate, weight decay). Right: AdamO grid over (tangential LR $\eta_\theta$, radial LR $\eta_\rho$). AdamO maintains strong performance across a wider region.
  • Figure :