Table of Contents
Fetching ...

Super Consistency of Neural Network Landscapes and Learning Rate Transfer

Lorenzo Noci, Alexandru Meterez, Thomas Hofmann, Antonio Orvieto

TL;DR

This work studies the landscape through the lens of the loss Hessian, with a focus on its largest eigenvalue, and finds that certain spectral properties under $\mu$P are largely independent of the size of the network, and remain consistent as training progresses, and names this property Super Consistency of the landscape.

Abstract

Recently, there has been growing evidence that if the width and depth of a neural network are scaled toward the so-called rich feature learning limit (\mup and its depth extension), then some hyperparameters -- such as the learning rate -- exhibit transfer from small to very large models. From an optimization perspective, this phenomenon is puzzling, as it implies that the loss landscape is consistently similar across very different model sizes. In this work, we study the landscape through the lens of the loss Hessian, with a focus on its largest eigenvalue (i.e. the sharpness), and find that certain spectral properties under $μ$P are largely independent of the size of the network, and remain consistent as training progresses. We name this property Super Consistency of the landscape. On the other hand, we show that in the Neural Tangent Kernel (NTK) and other scaling regimes, the sharpness exhibits very different dynamics at different scales. But what causes these differences in the sharpness dynamics? Through a connection between the Hessian's and the NTK's spectrum, we argue that the cause lies in the presence (for $μ$P) or progressive absence (for the NTK scaling) of feature learning. We corroborate our claims with a substantial suite of experiments, covering a wide range of datasets and architectures: from ResNets and Vision Transformers trained on benchmark vision datasets to Transformers-based language models trained on WikiText.

Super Consistency of Neural Network Landscapes and Learning Rate Transfer

TL;DR

This work studies the landscape through the lens of the loss Hessian, with a focus on its largest eigenvalue, and finds that certain spectral properties under P are largely independent of the size of the network, and remain consistent as training progresses, and names this property Super Consistency of the landscape.

Abstract

Recently, there has been growing evidence that if the width and depth of a neural network are scaled toward the so-called rich feature learning limit (\mup and its depth extension), then some hyperparameters -- such as the learning rate -- exhibit transfer from small to very large models. From an optimization perspective, this phenomenon is puzzling, as it implies that the loss landscape is consistently similar across very different model sizes. In this work, we study the landscape through the lens of the loss Hessian, with a focus on its largest eigenvalue (i.e. the sharpness), and find that certain spectral properties under P are largely independent of the size of the network, and remain consistent as training progresses. We name this property Super Consistency of the landscape. On the other hand, we show that in the Neural Tangent Kernel (NTK) and other scaling regimes, the sharpness exhibits very different dynamics at different scales. But what causes these differences in the sharpness dynamics? Through a connection between the Hessian's and the NTK's spectrum, we argue that the cause lies in the presence (for P) or progressive absence (for the NTK scaling) of feature learning. We corroborate our claims with a substantial suite of experiments, covering a wide range of datasets and architectures: from ResNets and Vision Transformers trained on benchmark vision datasets to Transformers-based language models trained on WikiText.
Paper Structure (65 sections, 6 theorems, 93 equations, 28 figures, 1 table)

This paper contains 65 sections, 6 theorems, 93 equations, 28 figures, 1 table.

Key Result

Theorem 5.1

Let $(E,V)$ evolve with GD at stepsize $\eta =\eta_0\gamma^2$ on the loss of Eq. eq:reduced_loss. The evolution of $(w,e,v)$ is completely described by the following self-contained equation: let the $^+$ denote updated quantities,

Figures (28)

  • Figure 1: Top row. Under $\mu$P, (left) the sharpness dynamics are largely identical for the whole training dynamics across different widths, phenomenon that we call Super Consistency. The dashed horizontal lines are the Edge of Stability thresholds. Center: The loss dynamics are similar early in training, but accumulate finite-size effects over time, thus violating Super Consistency. Right: The learning rate transfers from small to large model, suggesting that the loss landscape is Super Consistent across different model sizes. Bottom row. Under NTK parameterization (NTP), the sharpness dynamics show large discrepancies. Also, the learning rate does not transfer. The architecture is a two-layer convolutional network trained on CIFAR-10 with data augmentation, where the width corresponds to the number of filters in the convolution. (See App. \ref{['sec:exp-details']}). Other parameters: $B=128$, epochs $=50$.
  • Figure 2: (a) The top Hessian eigenvalues exhibit a progressive increase to a threshold, with larger eigenvalues showing precise Super Consistency, while lower eigenvalues show finite-size accumulation at small width in the initial phase of training. (b) Top eigenvalues of the NTK matrix $\Theta$. As opposed to the top eigenvalues of the Hessian, these exhibit evident finite-size accumulation during training. Model: 3-layer ConvNet, $\tau=0$, $\eta_0 = 0.7$ (optimal). Details in Sec. \ref{['sec:exp-details']}.
  • Figure 3: (a) Convergence rate of the sharpness at finite width $N$ to the infinite limit proxy. Note that the distance approaches $0$ as the training time increases. (b) Convergence rate of the loss at finite width $N$ to the infinite limit proxy. Note that the loss accumulates finite-size effects over time and the distance to the proxy increases. (c) Convergence rate of the top NTK eigenvalues over time to the infinite limit proxy. Similar to the loss, this also accumulates finite-size effects over time. Details: infinite limit proxy is width $4096$, model is ConvNet, $\tau=0$, $\eta_0 = 0.7$.
  • Figure 4: Depth-$\mu$P extensions with top row showing transfer plots and bottom row the sharpness evolution. (a) ConvNets with $1$ layer per block exhibit both hyperparameter transfer and sharpness Super Consistency. (b) ConvNets with $2$ layers per block. The model has a lazy behavior within each block, and no transfer. The sharpness starts accumulating finite-size effects during training, violating Super Consistency. (c) ViTs also have $k>2$ blocks per layer by design, and thus have a similar behaviour. Details: (a), (b) are trained with SGD, with widths $128$ and $32$ respectively; (c) is trained with Adam, with the learning rate scaled by $1/\sqrt{L}$yang2023tensor. See Fig. \ref{['fig:depth-independence-convergence']} for convergence rates.
  • Figure 5: Evolution of the top eigenvalues of the Hessian components $\mathcal{G}$ and $\mathcal{R}$ for a two-layer linear network trained on random data under MSE loss. The vector field characterizes the evolution during training for a fixed learning rate. Top: $\mu$P. Note how $\mathcal{G}$ drives the initial change super consistently. Bottom: NTP. For wider networks the sharpening phase reduces, since the network is approaching the limit where the NTK is fixed to its value at initialization.
  • ...and 23 more figures

Theorems & Definitions (7)

  • Theorem 5.1: Evolution Laws
  • Lemma 5.2: GN bound
  • Proposition 5.3: EoS
  • Corollary 5.4
  • Lemma : GN bound
  • proof
  • Proposition C.1