The Optimization Landscape of SGD Across the Feature Learning Strength
Alexander Atanasov, Alexandru Meterez, James B. Simon, Cengiz Pehlevan
TL;DR
This paper investigates how downscaling the final layer by γ in μP-trained neural networks controls feature learning strength, revealing a transition from lazy kernel dynamics to rich representation learning. Through large-scale online SGD experiments across MLPs, CNNs, ResNets, and ViTs on diverse datasets, the authors map a γ–η phase portrait: η_* scales as γ^2 in the lazy regime (γ ≪ 1) and as γ^{2/L} in ultra-rich regimes (γ ≫ 1), with a compute-budget–dependent optimizable region. They identify dynamical phenomena such as catapults at small γ, silent alignment and progressive sharpening at large γ, and stepwise loss drops, all supported by a simple deep linear toy model that captures the observed scalings. The findings show that large γ can improve generalization given sufficient training time, and that Hessian spectra shift from γ^{-2} to γ^{-2/L} as γ grows, offering insights into representation learning in performant models. The work suggests analytical exploration of the large-γ limit may yield practical and theoretical benefits for understanding and designing feature-learning dynamics in deep networks.
Abstract
We consider neural networks (NNs) where the final layer is down-scaled by a fixed hyperparameter $γ$. Recent work has identified $γ$ as controlling the strength of feature learning. As $γ$ increases, network evolution changes from "lazy" kernel dynamics to "rich" feature-learning dynamics, with a host of associated benefits including improved performance on common tasks. In this work, we conduct a thorough empirical investigation of the effect of scaling $γ$ across a variety of models and datasets in the online training setting. We first examine the interaction of $γ$ with the learning rate $η$, identifying several scaling regimes in the $γ$-$η$ plane which we explain theoretically using a simple model. We find that the optimal learning rate $η^*$ scales non-trivially with $γ$. In particular, $η^* \propto γ^2$ when $γ\ll 1$ and $η^* \propto γ^{2/L}$ when $γ\gg 1$ for a feed-forward network of depth $L$. Using this optimal learning rate scaling, we proceed with an empirical study of the under-explored "ultra-rich" $γ\gg 1$ regime. We find that networks in this regime display characteristic loss curves, starting with a long plateau followed by a drop-off, sometimes followed by one or more additional staircase steps. We find networks of different large $γ$ values optimize along similar trajectories up to a reparameterization of time. We further find that optimal online performance is often found at large $γ$ and could be missed if this hyperparameter is not tuned. Our findings indicate that analytical study of the large-$γ$ limit may yield useful insights into the dynamics of representation learning in performant models.
