The Optimization Landscape of SGD Across the Feature Learning Strength

Alexander Atanasov; Alexandru Meterez; James B. Simon; Cengiz Pehlevan

The Optimization Landscape of SGD Across the Feature Learning Strength

Alexander Atanasov, Alexandru Meterez, James B. Simon, Cengiz Pehlevan

TL;DR

This paper investigates how downscaling the final layer by γ in μP-trained neural networks controls feature learning strength, revealing a transition from lazy kernel dynamics to rich representation learning. Through large-scale online SGD experiments across MLPs, CNNs, ResNets, and ViTs on diverse datasets, the authors map a γ–η phase portrait: η_* scales as γ^2 in the lazy regime (γ ≪ 1) and as γ^{2/L} in ultra-rich regimes (γ ≫ 1), with a compute-budget–dependent optimizable region. They identify dynamical phenomena such as catapults at small γ, silent alignment and progressive sharpening at large γ, and stepwise loss drops, all supported by a simple deep linear toy model that captures the observed scalings. The findings show that large γ can improve generalization given sufficient training time, and that Hessian spectra shift from γ^{-2} to γ^{-2/L} as γ grows, offering insights into representation learning in performant models. The work suggests analytical exploration of the large-γ limit may yield practical and theoretical benefits for understanding and designing feature-learning dynamics in deep networks.

Abstract

We consider neural networks (NNs) where the final layer is down-scaled by a fixed hyperparameter $γ$. Recent work has identified $γ$ as controlling the strength of feature learning. As $γ$ increases, network evolution changes from "lazy" kernel dynamics to "rich" feature-learning dynamics, with a host of associated benefits including improved performance on common tasks. In this work, we conduct a thorough empirical investigation of the effect of scaling $γ$ across a variety of models and datasets in the online training setting. We first examine the interaction of $γ$ with the learning rate $η$, identifying several scaling regimes in the $γ$-$η$ plane which we explain theoretically using a simple model. We find that the optimal learning rate $η^*$ scales non-trivially with $γ$. In particular, $η^* \propto γ^2$ when $γ\ll 1$ and $η^* \propto γ^{2/L}$ when $γ\gg 1$ for a feed-forward network of depth $L$. Using this optimal learning rate scaling, we proceed with an empirical study of the under-explored "ultra-rich" $γ\gg 1$ regime. We find that networks in this regime display characteristic loss curves, starting with a long plateau followed by a drop-off, sometimes followed by one or more additional staircase steps. We find networks of different large $γ$ values optimize along similar trajectories up to a reparameterization of time. We further find that optimal online performance is often found at large $γ$ and could be missed if this hyperparameter is not tuned. Our findings indicate that analytical study of the large-$γ$ limit may yield useful insights into the dynamics of representation learning in performant models.

The Optimization Landscape of SGD Across the Feature Learning Strength

TL;DR

Abstract

We consider neural networks (NNs) where the final layer is down-scaled by a fixed hyperparameter

. Recent work has identified

as controlling the strength of feature learning. As

increases, network evolution changes from "lazy" kernel dynamics to "rich" feature-learning dynamics, with a host of associated benefits including improved performance on common tasks. In this work, we conduct a thorough empirical investigation of the effect of scaling

across a variety of models and datasets in the online training setting. We first examine the interaction of

with the learning rate

, identifying several scaling regimes in the

plane which we explain theoretically using a simple model. We find that the optimal learning rate

scales non-trivially with

. In particular,

when

and

when

for a feed-forward network of depth

. Using this optimal learning rate scaling, we proceed with an empirical study of the under-explored "ultra-rich"

regime. We find that networks in this regime display characteristic loss curves, starting with a long plateau followed by a drop-off, sometimes followed by one or more additional staircase steps. We find networks of different large

values optimize along similar trajectories up to a reparameterization of time. We further find that optimal online performance is often found at large

and could be missed if this hyperparameter is not tuned. Our findings indicate that analytical study of the large-

limit may yield useful insights into the dynamics of representation learning in performant models.

Paper Structure (62 sections, 22 equations, 45 figures, 1 table)

This paper contains 62 sections, 22 equations, 45 figures, 1 table.

Introduction
Related Works
Setup and Notation
Why Train Online?
Empirical Results
Phase Portrait of $\eta$ with $\gamma$
At small $\gamma$, $\eta \propto \gamma^2$ for lazy networks
At small $\gamma$ and larger $\eta$, catapults can occur
At large $\gamma$, there are two learning rate scalings
Sufficiently large $\gamma$ NNs see improved scaling laws
Large-$\gamma$ NNs achieve good generalization, given sufficient training time
Hessian Scaling
Dynamical Phenomena
The catapult effect at small $\gamma$
Silent alignment at large $\gamma$
...and 47 more sections

Figures (45)

Figure 1: Phase portraits of the $\gamma-\eta$ plane for MSE and cross-entropy losses. a,b) Schematics of the regimes of network training for both losses. As the number of gradient steps increases, the lower boundaries of the convergent region descend, and the "no training" region shrinks. $L$ here denotes the network depth. All scalings are obtained analytically in \ref{['sec:linear_network']}. c,d) Final accuracies of deep networks trained across a grid of values of $\gamma, \eta$ for c) and MLP on MNIST-1M with MSE and d) a CNN on CIFAR-5M with cross-entropy loss. As shown by dashed lines, these empirics agree with our analytical diagrams. These results are robust to the choice of model and task.
Figure 2: a) Online loss curves of CNN trained on CIFAR-5M across $\gamma$, at the second largest convergent $\eta$. $\eta$ scales as $\gamma^2$ for $\gamma < 1$ and $\gamma^{2/L}$ for $\gamma > 1$. We also observe that very large $\gamma$ have a long period of flat loss before a drop. We overlay dashed lines to highlight the different power law scalings of loss with training time observed in the lazy and rich regime. b) Final test accuracy as a function of $\gamma$. We see that, as long as one trains long enough, larger $\gamma$ yields equal or better generalization to $\gamma = 1$, but the returns are marginal past some point. We verify this for other networks in Appendix \ref{['app:late_time_rich_consistency']}. Error bars show the small variance over three initializations.
Figure 3: a) The top 20 eigenvalues of the Hessian at the end of training vs $\gamma$ for an NN trained with MSE. We clearly see two regimes. For small $\gamma$ we see that $\lambda_{\max}$ b) The eigenvalue vs time across several value of $\gamma$. c) The top eigenvalues across time for a lazy network. In the lazy setting, we see ten outliers, equal to the number of classes. The Hessian otherwise does not change in MSE. d) The same for a rich network. At late time, rather than seeing a small set of outliers, we see many eigenvalues grow to a sizeable range. Further plots are given in Appendix \ref{['app:Hessian']}.
Figure 4: a) At sufficiently small $\gamma$, we observe that the loss catapults. Small $\gamma$ incur a larger catapult. Further catapult plots are in Appendix \ref{['app:Catapult_plots']}. b) At large $\gamma$, we do not observe catapults. At the optimal learning rate, the loss stays near a saddle for an extended period of time, before suddenly dropping. We show that during this period, the alignment between the final-layer kernel and the task grows before the loss drops. Kernel-target alignment is defined in Appendix \ref{['app:KTA']}. Further silent alignment plots are in Appendix \ref{['app:SA']}. We also see characteristic step-wise loss drops in this regime.
Figure 5: a) Function outputs between pairs of networks, as described in the main text. We see that lazy networks have nearly identical function outputs, and that rich networks agree in their function outputs at the end of training as well. See Appendix \ref{['app:fn_plots']} for further plots. b) We study the kernel-target alignment of final-layer representations across $\gamma$ at the end of training.
...and 40 more figures

The Optimization Landscape of SGD Across the Feature Learning Strength

TL;DR

Abstract

The Optimization Landscape of SGD Across the Feature Learning Strength

Authors

TL;DR

Abstract

Table of Contents

Figures (45)