Table of Contents
Fetching ...

Never Saddle for Reparameterized Steepest Descent as Mirror Flow

Tom Jacobs, Chao Zhou, Rebekka Burkholz

TL;DR

Focusing on diagonal linear networks and deep diagonal linear reparameterizations, it is shown that steeper descent facilitates both saddle-point escape and feature learning, and demonstrates that decoupled weight decay, as in AdamW, stabilizes feature learning by enforcing novel balance equations.

Abstract

How does the choice of optimization algorithm shape a model's ability to learn features? To address this question for steepest descent methods --including sign descent, which is closely related to Adam --we introduce steepest mirror flows as a unifying theoretical framework. This framework reveals how optimization geometry governs learning dynamics, implicit bias, and sparsity and it provides two explanations for why Adam and AdamW often outperform SGD in fine-tuning. Focusing on diagonal linear networks and deep diagonal linear reparameterizations (a simplified proxy for attention), we show that steeper descent facilitates both saddle-point escape and feature learning. In contrast, gradient descent requires unrealistically large learning rates to escape saddles, an uncommon regime in fine-tuning. Empirically, we confirm that saddle-point escape is a central challenge in fine-tuning. Furthermore, we demonstrate that decoupled weight decay, as in AdamW, stabilizes feature learning by enforcing novel balance equations. Together, these results highlight two mechanisms how steepest descent can aid modern optimization.

Never Saddle for Reparameterized Steepest Descent as Mirror Flow

TL;DR

Focusing on diagonal linear networks and deep diagonal linear reparameterizations, it is shown that steeper descent facilitates both saddle-point escape and feature learning, and demonstrates that decoupled weight decay, as in AdamW, stabilizes feature learning by enforcing novel balance equations.

Abstract

How does the choice of optimization algorithm shape a model's ability to learn features? To address this question for steepest descent methods --including sign descent, which is closely related to Adam --we introduce steepest mirror flows as a unifying theoretical framework. This framework reveals how optimization geometry governs learning dynamics, implicit bias, and sparsity and it provides two explanations for why Adam and AdamW often outperform SGD in fine-tuning. Focusing on diagonal linear networks and deep diagonal linear reparameterizations (a simplified proxy for attention), we show that steeper descent facilitates both saddle-point escape and feature learning. In contrast, gradient descent requires unrealistically large learning rates to escape saddles, an uncommon regime in fine-tuning. Empirically, we confirm that saddle-point escape is a central challenge in fine-tuning. Furthermore, we demonstrate that decoupled weight decay, as in AdamW, stabilizes feature learning by enforcing novel balance equations. Together, these results highlight two mechanisms how steepest descent can aid modern optimization.
Paper Structure (45 sections, 23 theorems, 98 equations, 36 figures, 12 tables, 2 algorithms)

This paper contains 45 sections, 23 theorems, 98 equations, 36 figures, 12 tables, 2 algorithms.

Key Result

Theorem 3.4

(Theorem 3.9 Li2022ImplicitBO) Given $(Z,Y)$, suppose the objective $f(x)$ is of the form $f(x) = f(Zx)$ for some differentiable $f : \mathbb{R}^n \rightarrow \mathbb{R}$. Initialized at $x_0 = x_{\text{init}}$, assume that the mirror flow Eq. (equation : time varying mirror) converges to $x_{\infty $D_R$ is also known as the Bregman divergence (Definition definition : Bregman divergence) with res

Figures (36)

  • Figure 1: For a deep diagonal linear network initialized close to a saddle point, sign gradient flow (SignGF) converges faster than gradient flow (GF).
  • Figure 2: Illustration of different steepest mirror flows (with varied $q$). On the left side, the metric exponent is shown dependent on the associated depth. A high metric exponent increases the difficulty to escape zero and the instability of the flow. The right side illustrates saddle escape by plotting the solutions of the ODE's corresponding to the metric exponents, $d x_t = x_t^q dt$, with $x_0 = 0.1$ (from the origin). Concluding, SignGF does not get stuck near saddles and still allows feature learning by entering the green strip in the plot on the left, effectively inducing sparsity.
  • Figure 3: The balance equation for $q \in \{1, 1.5, 2\}$ and initialization $m =0.1, w= 0$. Observe that the (curved) path away from the initialization to a point on the curve $m w = x$ with $x = \pm 0.1$ (in the plot) is shorter for smaller $q$, indicating faster saddle escape.
  • Figure 4: The $L_{\infty}-$margin for Adam with high and low depth $L$. The green region indicates the non-zero ground truth features. Higher depth leads to sparse ground truth recovery in line with Corollary \ref{['corollary : balance sign gd']}.
  • Figure 5: Eigenvalue spectra in finetuning for an ImageNet pretrained ResNet-18 on CIFAR-10 (a) and weight sparsity in reparameterized training for a ResNet-50 on Imagenet (b).
  • ...and 31 more figures

Theorems & Definitions (50)

  • Definition 3.1
  • Example 3.2
  • Definition 3.3
  • Theorem 3.4
  • Definition 4.1
  • Theorem 4.2
  • Theorem 4.3
  • Remark 4.4
  • Lemma 4.5
  • Definition 4.6
  • ...and 40 more