Table of Contents
Fetching ...

The lazy (NTK) and rich ($μ$P) regimes: a gentle tutorial

Dhruva Karkada

TL;DR

Problem: understand how width and hyperparameters shape training dynamics in very wide networks. Approach: present a nonrigorous, illustrative derivation of a single richness hyperparameter $r$ that governs the size of hidden updates $\norm{\Delta{\bm{h}}}$, showing $\norm{\Delta{\bm{h}}} \sim n^{r}$ with $0 \le r \le \tfrac{1}{2}$ and a phase transition between lazy NTK behavior ($r<\tfrac{1}{2}$) and active μP behavior ($r=\tfrac{1}{2}$) in the infinite-width limit. Findings: the framework predicts weight alignment, bounded gradient magnitudes, and equivalence of parameterizations via model rescaling and layerwise learning rates, with empirical support on practical architectures. Significance: offers a unified theory linking kernel-like training to feature-learning dynamics in wide networks and suggests practical tuning via the richness scale to study representation learning.

Abstract

A central theme of the modern machine learning paradigm is that larger neural networks achieve better performance on a variety of metrics. Theoretical analyses of these overparameterized models have recently centered around studying very wide neural networks. In this tutorial, we provide a nonrigorous but illustrative derivation of the following fact: in order to train wide networks effectively, there is only one degree of freedom in choosing hyperparameters such as the learning rate and the size of the initial weights. This degree of freedom controls the richness of training behavior: at minimum, the wide network trains lazily like a kernel machine, and at maximum, it exhibits feature learning in the active $μ$P regime. In this paper, we explain this richness scale, synthesize recent research results into a coherent whole, offer new perspectives and intuitions, and provide empirical evidence supporting our claims. In doing so, we hope to encourage further study of the richness scale, as it may be key to developing a scientific theory of feature learning in practical deep neural networks.

The lazy (NTK) and rich ($μ$P) regimes: a gentle tutorial

TL;DR

Problem: understand how width and hyperparameters shape training dynamics in very wide networks. Approach: present a nonrigorous, illustrative derivation of a single richness hyperparameter that governs the size of hidden updates , showing with and a phase transition between lazy NTK behavior () and active μP behavior () in the infinite-width limit. Findings: the framework predicts weight alignment, bounded gradient magnitudes, and equivalence of parameterizations via model rescaling and layerwise learning rates, with empirical support on practical architectures. Significance: offers a unified theory linking kernel-like training to feature-learning dynamics in wide networks and suggests practical tuning via the richness scale to study representation learning.

Abstract

A central theme of the modern machine learning paradigm is that larger neural networks achieve better performance on a variety of metrics. Theoretical analyses of these overparameterized models have recently centered around studying very wide neural networks. In this tutorial, we provide a nonrigorous but illustrative derivation of the following fact: in order to train wide networks effectively, there is only one degree of freedom in choosing hyperparameters such as the learning rate and the size of the initial weights. This degree of freedom controls the richness of training behavior: at minimum, the wide network trains lazily like a kernel machine, and at maximum, it exhibits feature learning in the active P regime. In this paper, we explain this richness scale, synthesize recent research results into a coherent whole, offer new perspectives and intuitions, and provide empirical evidence supporting our claims. In doing so, we hope to encourage further study of the richness scale, as it may be key to developing a scientific theory of feature learning in practical deep neural networks.
Paper Structure (18 sections, 40 equations, 6 figures, 2 tables)

This paper contains 18 sections, 40 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: For well-behaved sufficiently-wide models, training behavior is characterized by a single richness hyperparameter $r$ prescribing how the size of the hidden representation updates $\norm{\Delta {\bm{h}}}$ scales with model width $n$. At finite width, model behavior changes smoothly between the NTK endpoint $r=0$ and the $\mu$P endpoint $r=1/2$, but in the thermodynamic limit ($n\to\infty$) there is a discontinuous phase transition separating active $\mu$P behavior from lazy $r<1/2$ behavior.
  • Figure 2: Signal propagation diagram for our wide 3-layer linear model. This diagram visually depicts the logical flow of our main derivation: we analyze the first forward pass, the first backward pass, and the second forward pass to enforce our training criteria (shown in pink) and constrain our initial nine degrees of freedom (shown in blue). We depict the forward pass signals flowing right to left to match the convention for matrix multiplication.
  • Figure 3: Width-scaling of representations match predictions. We report measurements of 3-layer linear models learning Gaussian data across widths and richnesses. (A) Depicts how we measure the scaling exponent of some scalar $s$, which we denote $\text{scaling}(s)$. In this case, $s=\norm{\Delta{\bm{h}}_2}$; we measure it across widths and fit a line (on log-scale) whose slope is the measured scaling exponent. We plot the average over 20 network instances and 50 training samples. (B) We verify that the \ref{['crit:uuc']} holds at all layers and richnesses. Dotted line denotes theory prediction. (C) The relative sizes of representation updates match predictions. We see that the hidden representations fall on the lower dotted diagonal as predicted (low richness yields small updates). At initialization, we see the relative size of the output updates (blue triangles) fall on the upper diagonal (in the rich regime, initial outputs scale inversely with width). After a gradient step, the output size match the update size. See \ref{['apdx:experiments']} for details.
  • Figure 4: Training outside the richness scale yields unstable training at large width. We report measurements of practical convolutional network learning a minibatch of CIFAR-10. (A) Training is well-behaved on the richness scale ($0\leq r\leq 0.5)$; outside in the shaded regions, training error either diverges or converges very slowly. This effect becomes more prominent as width increases. (B) The loss dynamics of the width-1024 architecture reveals that the $r>0.5$ regime initially has reasonably-sized outputs, but training is unstable. On the other hand, in the $r<0$ regime, the initial outputs blow up. Although gradient descent eventually corrects this, the correction timescale diverges with width. The horizontal cross-section at the dashed line is the orange curve in panel A. (C) Here, we retain the same convolutional architecture but use standard parameterization (i.e., the default PyTorch initialization). At sufficiently large width, training diverges as predicted. However, to see this effect at practical widths, we used a global learning rate $\eta=1$ (compared to $\eta=0.1$ in the other experiments). The observed stability at moderate widths and learning rates suggests that training standard neural networks may be phenomenologically similar to training $\mu$P networks. See \ref{['apdx:experiments']} for experimental details.
  • Figure 5: Model linearization matches predictions. For 3-layer linear models learning Gaussian data across widths and richnesses, we measure the change in the gradient across the first optimization step. (A) The change in the gradient decays with width in the kernel regime. (Here, the subscripts in $f_0$ and $f_1$ enumerate time steps, not layers.) (B) The scaling matches the prediction in \ref{['eq:gradchange']}.
  • ...and 1 more figures