Table of Contents
Fetching ...

On the Surprising Effectiveness of Large Learning Rates under Standard Width Scaling

Moritz Haas, Sebastian Bordt, Ulrike von Luxburg, Leena Chennuru Vankadara

TL;DR

The paper tackles why networks trained with standard parameterization retain stable learning and exhibit feature learning at very large widths, despite infinite-width theory predicting instability. It introduces a finer-grained regime analysis under CE loss, identifying a controlled-divergence window in which logits diverge but gradients and activations stay bounded, and shows that this regime admits a well-defined infinite-width limit where hidden-layer features continue to evolve. By combining TP-based width-scaling arguments with extensive empirical checks across MLPs and Transformers, it derives maximal stable learning-rate exponents (notably $eta= frac{1}{2}$ for CE in SP) and explains the practical success of CE over MSE in SP, as well as the benefits and limitations of layerwise LR schemes like extmu P. The work provides actionable guidance on LR selection and clarifies when CE is advantageous, while outlining limitations and future directions for extending infinite-width proxies to more realistic training dynamics.

Abstract

Scaling limits, such as infinite-width limits, serve as promising theoretical tools to study large-scale models. However, it is widely believed that existing infinite-width theory does not faithfully explain the behavior of practical networks, especially those trained in standard parameterization (SP) meaning He initialization with a global learning rate. For instance, existing theory for SP predicts instability at large learning rates and vanishing feature learning at stable ones. In practice, however, optimal learning rates decay slower than theoretically predicted and networks exhibit both stable training and non-trivial feature learning, even at very large widths. Here, we show that this discrepancy is not fully explained by finite-width phenomena. Instead, we find a resolution through a finer-grained analysis of the regime previously considered unstable and therefore uninteresting. In particular, we show that, under cross-entropy (CE) loss, the unstable regime comprises two distinct sub-regimes: a catastrophically unstable regime and a more benign controlled divergence regime, where logits diverge but gradients and activations remain stable. Moreover, under large learning rates at the edge of the controlled divergence regime, there exists a well-defined infinite width limit where features continue to evolve in all the hidden layers. In experiments across optimizers, architectures, and data modalities, we validate that neural networks operate in this controlled divergence regime under CE loss but not under MSE loss. Our empirical evidence suggests that width-scaling considerations are surprisingly useful for predicting empirically maximal stable learning rate exponents which provide useful guidance on optimal learning rate exponents. Finally, our analysis clarifies the effectiveness and limitations of recently proposed layerwise learning rate scaling for standard initialization.

On the Surprising Effectiveness of Large Learning Rates under Standard Width Scaling

TL;DR

The paper tackles why networks trained with standard parameterization retain stable learning and exhibit feature learning at very large widths, despite infinite-width theory predicting instability. It introduces a finer-grained regime analysis under CE loss, identifying a controlled-divergence window in which logits diverge but gradients and activations stay bounded, and shows that this regime admits a well-defined infinite-width limit where hidden-layer features continue to evolve. By combining TP-based width-scaling arguments with extensive empirical checks across MLPs and Transformers, it derives maximal stable learning-rate exponents (notably for CE in SP) and explains the practical success of CE over MSE in SP, as well as the benefits and limitations of layerwise LR schemes like extmu P. The work provides actionable guidance on LR selection and clarifies when CE is advantageous, while outlining limitations and future directions for extending infinite-width proxies to more realistic training dynamics.

Abstract

Scaling limits, such as infinite-width limits, serve as promising theoretical tools to study large-scale models. However, it is widely believed that existing infinite-width theory does not faithfully explain the behavior of practical networks, especially those trained in standard parameterization (SP) meaning He initialization with a global learning rate. For instance, existing theory for SP predicts instability at large learning rates and vanishing feature learning at stable ones. In practice, however, optimal learning rates decay slower than theoretically predicted and networks exhibit both stable training and non-trivial feature learning, even at very large widths. Here, we show that this discrepancy is not fully explained by finite-width phenomena. Instead, we find a resolution through a finer-grained analysis of the regime previously considered unstable and therefore uninteresting. In particular, we show that, under cross-entropy (CE) loss, the unstable regime comprises two distinct sub-regimes: a catastrophically unstable regime and a more benign controlled divergence regime, where logits diverge but gradients and activations remain stable. Moreover, under large learning rates at the edge of the controlled divergence regime, there exists a well-defined infinite width limit where features continue to evolve in all the hidden layers. In experiments across optimizers, architectures, and data modalities, we validate that neural networks operate in this controlled divergence regime under CE loss but not under MSE loss. Our empirical evidence suggests that width-scaling considerations are surprisingly useful for predicting empirically maximal stable learning rate exponents which provide useful guidance on optimal learning rate exponents. Finally, our analysis clarifies the effectiveness and limitations of recently proposed layerwise learning rate scaling for standard initialization.

Paper Structure

This paper contains 47 sections, 5 theorems, 35 equations, 68 figures, 2 tables.

Key Result

Proposition 1

(Asymptotic regimes in SP, informal) For fixed $L\geq 2$, $t\geq 1$, $\eta>0$, $\alpha\in\mathbb{R}$, consider training a $(L+1)$-layer MLP of width $n$ in SP with SGD and global learning rate $\eta_n=\eta\cdot n^{-\alpha}$ for $t$ steps. Then the logits $f_t$, loss-logit derivatives $\chi_t:=\parti Under mean‑squared error (MSE) loss, a stable regime as in (a) above arises if $\alpha\geq 1$. If

Figures (68)

  • Figure 1: Optimal learning rate exponents exceed the theoretically predicted stability threshold. For MLPs on MNIST and GPT on language data, optimal learning rates in SP decay slower than the theoretically predicted maximal stable $\eta_n=\mathcal{O}(n^{-1})$ in gray.
  • Figure 2: Alignment has minimal width-dependence. Alignment ratio between accumulated weight updates $\Delta W_t$ and incoming activations $x_t$ in RMS norm (left) and operator norm (center) as well as between initial weights $W_0$ and activation updates $\Delta x_t$ in operator norm (right) for the last layernorm layer, the first MLP layer in Transformer block 2 and the readout layer. RMS norm may be confounded by accumulated rank over the course of training (e.g. compare $(\Delta W_t,x_t)$ values for last LN). While operator norm alignment tends to decay over the course of training, it does not display strong width-dependence, even after $2000$ batches (see annotated width-dependent exponents).
  • Figure 3: Learning rate regimes for SGD in SP. Under MSE loss, training a deep MLP either remains stable ($\alpha\geq 1$) or logits and hidden-layer activations diverge ($\alpha<1$) in the infinite-width limit. Under CE loss, a controlled divergence regime $\alpha\in[{1}/{2},1)$ emerges where logits diverge, but training does not diverge. At $\alpha={1}/{2}$, hidden layers learn features width-independently.
  • Figure 4: Hidden-layer feature learning albeit logit divergence in SP under large learning rates. Effective $l$-th layer update scalings $\|\Delta W_t x_t\|_{RMS}$ of MLPs trained with SGD in SP with $\eta_n=0.0001\cdot (n/256)^{-1/2}$ on CIFAR-10 under CE loss. Our TP scaling predictions are accurate: Hidden layers learn features width-independently, and input layers have vanishing feature learning. The update scaling exponents can already be accurately estimated at small width $n\le 512$.
  • Figure 5: Approximate learning rate transfer for GPT in SP.Left to center-right: Width-scaled learning rate versus training loss for GPT trained with SGD, Adam with trainable Layernorm parameters and Adam without trainable Layernorm parameters. Right: Corresponding optimal (solid) and maximal stable (dashed) learning rate exponents. For SGD, hidden-layer stability $\eta_n=\mathcal{O}(n^{-1/2})$ clearly dominates the maximal stable as well as optimal learning rate scaling. For Adam without Layernorm parameters, hidden-layer stability induces a stability threshold $\eta_n=\mathcal{O}(n^{-1})$. Trainable Layernorm parameters further stabilize large learning rates and induce larger optimal learning rate scaling $\eta_n\approx\Theta(n^{-1/2})$ toward preserving input-layer feature learning at scale.
  • ...and 63 more figures

Theorems & Definitions (6)

  • Proposition 1
  • Proposition 2: Under CE loss, SP with large learning rates learns features at large width, informal
  • Proposition C.1
  • Proposition C.2: Under CE loss, SP with large learning rates learns features at large width
  • Proposition C.3: Characterizing loss decrease in SP and NTP
  • proof