On the Surprising Effectiveness of Large Learning Rates under Standard Width Scaling
Moritz Haas, Sebastian Bordt, Ulrike von Luxburg, Leena Chennuru Vankadara
TL;DR
The paper tackles why networks trained with standard parameterization retain stable learning and exhibit feature learning at very large widths, despite infinite-width theory predicting instability. It introduces a finer-grained regime analysis under CE loss, identifying a controlled-divergence window in which logits diverge but gradients and activations stay bounded, and shows that this regime admits a well-defined infinite-width limit where hidden-layer features continue to evolve. By combining TP-based width-scaling arguments with extensive empirical checks across MLPs and Transformers, it derives maximal stable learning-rate exponents (notably $eta= frac{1}{2}$ for CE in SP) and explains the practical success of CE over MSE in SP, as well as the benefits and limitations of layerwise LR schemes like extmu P. The work provides actionable guidance on LR selection and clarifies when CE is advantageous, while outlining limitations and future directions for extending infinite-width proxies to more realistic training dynamics.
Abstract
Scaling limits, such as infinite-width limits, serve as promising theoretical tools to study large-scale models. However, it is widely believed that existing infinite-width theory does not faithfully explain the behavior of practical networks, especially those trained in standard parameterization (SP) meaning He initialization with a global learning rate. For instance, existing theory for SP predicts instability at large learning rates and vanishing feature learning at stable ones. In practice, however, optimal learning rates decay slower than theoretically predicted and networks exhibit both stable training and non-trivial feature learning, even at very large widths. Here, we show that this discrepancy is not fully explained by finite-width phenomena. Instead, we find a resolution through a finer-grained analysis of the regime previously considered unstable and therefore uninteresting. In particular, we show that, under cross-entropy (CE) loss, the unstable regime comprises two distinct sub-regimes: a catastrophically unstable regime and a more benign controlled divergence regime, where logits diverge but gradients and activations remain stable. Moreover, under large learning rates at the edge of the controlled divergence regime, there exists a well-defined infinite width limit where features continue to evolve in all the hidden layers. In experiments across optimizers, architectures, and data modalities, we validate that neural networks operate in this controlled divergence regime under CE loss but not under MSE loss. Our empirical evidence suggests that width-scaling considerations are surprisingly useful for predicting empirically maximal stable learning rate exponents which provide useful guidance on optimal learning rate exponents. Finally, our analysis clarifies the effectiveness and limitations of recently proposed layerwise learning rate scaling for standard initialization.
