The lazy (NTK) and rich ($μ$P) regimes: a gentle tutorial
Dhruva Karkada
TL;DR
Problem: understand how width and hyperparameters shape training dynamics in very wide networks. Approach: present a nonrigorous, illustrative derivation of a single richness hyperparameter $r$ that governs the size of hidden updates $\norm{\Delta{\bm{h}}}$, showing $\norm{\Delta{\bm{h}}} \sim n^{r}$ with $0 \le r \le \tfrac{1}{2}$ and a phase transition between lazy NTK behavior ($r<\tfrac{1}{2}$) and active μP behavior ($r=\tfrac{1}{2}$) in the infinite-width limit. Findings: the framework predicts weight alignment, bounded gradient magnitudes, and equivalence of parameterizations via model rescaling and layerwise learning rates, with empirical support on practical architectures. Significance: offers a unified theory linking kernel-like training to feature-learning dynamics in wide networks and suggests practical tuning via the richness scale to study representation learning.
Abstract
A central theme of the modern machine learning paradigm is that larger neural networks achieve better performance on a variety of metrics. Theoretical analyses of these overparameterized models have recently centered around studying very wide neural networks. In this tutorial, we provide a nonrigorous but illustrative derivation of the following fact: in order to train wide networks effectively, there is only one degree of freedom in choosing hyperparameters such as the learning rate and the size of the initial weights. This degree of freedom controls the richness of training behavior: at minimum, the wide network trains lazily like a kernel machine, and at maximum, it exhibits feature learning in the active $μ$P regime. In this paper, we explain this richness scale, synthesize recent research results into a coherent whole, offer new perspectives and intuitions, and provide empirical evidence supporting our claims. In doing so, we hope to encourage further study of the richness scale, as it may be key to developing a scientific theory of feature learning in practical deep neural networks.
