Table of Contents
Fetching ...

Tensor Programs IVb: Adaptive Optimization in the Infinite-Width Limit

Greg Yang, Etai Littwin

TL;DR

The paper extends infinite-width neural network theory to adaptive optimizers by introducing NexorT, a Tensor Program framework that models nonlinear gradient processing via nonlinear outer products. It establishes Neural Tangent and Maximal Update limits for arbitrary architectures under entrywise updates, showing that adaptive optimizers preserve the dynamical dichotomy between feature learning and kernel-like behavior, with a nonlinear operator in the kernel regime. Central to the theory are the master theorem for NexorT and the bra-ket notation, which together yield a universal, architecture-agnostic description of training dynamics in the infinite-width limit. The results unify prior SGD-based analyses with adaptive optimization, classify abcd-parametrizations into feature-learning versus operator regimes, and provide a path to analyze future optimizers within a rigorous, width-asymptotic framework.

Abstract

Going beyond stochastic gradient descent (SGD), what new phenomena emerge in wide neural networks trained by adaptive optimizers like Adam? Here we show: The same dichotomy between feature learning and kernel behaviors (as in SGD) holds for general optimizers as well, including Adam -- albeit with a nonlinear notion of "kernel." We derive the corresponding "neural tangent" and "maximal update" limits for any architecture. Two foundational advances underlie the above results: 1) A new Tensor Program language, NEXORT, that can express how adaptive optimizers process gradients into updates. 2) The introduction of bra-ket notation to drastically simplify expressions and calculations in Tensor Programs. This work summarizes and generalizes all previous results in the Tensor Programs series of papers.

Tensor Programs IVb: Adaptive Optimization in the Infinite-Width Limit

TL;DR

The paper extends infinite-width neural network theory to adaptive optimizers by introducing NexorT, a Tensor Program framework that models nonlinear gradient processing via nonlinear outer products. It establishes Neural Tangent and Maximal Update limits for arbitrary architectures under entrywise updates, showing that adaptive optimizers preserve the dynamical dichotomy between feature learning and kernel-like behavior, with a nonlinear operator in the kernel regime. Central to the theory are the master theorem for NexorT and the bra-ket notation, which together yield a universal, architecture-agnostic description of training dynamics in the infinite-width limit. The results unify prior SGD-based analyses with adaptive optimization, classify abcd-parametrizations into feature-learning versus operator regimes, and provide a path to analyze future optimizers within a rigorous, width-asymptotic framework.

Abstract

Going beyond stochastic gradient descent (SGD), what new phenomena emerge in wide neural networks trained by adaptive optimizers like Adam? Here we show: The same dichotomy between feature learning and kernel behaviors (as in SGD) holds for general optimizers as well, including Adam -- albeit with a nonlinear notion of "kernel." We derive the corresponding "neural tangent" and "maximal update" limits for any architecture. Two foundational advances underlie the above results: 1) A new Tensor Program language, NEXORT, that can express how adaptive optimizers process gradients into updates. 2) The introduction of bra-ket notation to drastically simplify expressions and calculations in Tensor Programs. This work summarizes and generalizes all previous results in the Tensor Programs series of papers.
Paper Structure (160 sections, 62 theorems, 252 equations, 4 figures, 1 table)

This paper contains 160 sections, 62 theorems, 252 equations, 4 figures, 1 table.

Key Result

Proposition 2.2.3

For every $l \in [L+1]$,

Figures (4)

  • Figure 1: A Caricature of abcd-Parametrizations. The nontrivial stable faithful parametrizations form a high dimensional polyhedron. Those on a part of its boundary admit feature learning, while all others are in the operator regime. $\mu$P is a vertex in the former, while NTP, latter. The overall shape is similar to yang4
  • Figure 2: A graphical illustration of Baranyai's Theorem for $n=8,r=2$. Here $G^8_2$ is just the usual complete graph on 8 vertices. A perfect matching here reduces to the usual meaning on graphs: a set of 4 edges that covers all 8 vertices. Every edge above is colored, and for each color, the edges with that color form a perfect matching. Image source:https://en.wikipedia.org/wiki/Baranyai%27s_theorem#/media/File:Complete-edge-coloring.svg
  • Figure 3: Adam training dynamics of finite and infinite-width networks in NTP. We train networks of widths 64 (a), 512 (b), 7000 (c), and track the outputs for 4 random inputs (one per row) at each iteration as the network trains. We compute the output distribution over 10 independent runs for each network, and compare with the infinite-width dynamics (black curve). As the width grows, the network function converges to that of the infinite-width dynamics captured in \ref{['thm:NT_MLP_memoryful']}.
  • Figure 4: Adam training dynamics of finite and infinite-width networks in $\mu$P. We train networks of widths 64 (a), 512 (b), 7000 (c), and track the outputs for 4 random inputs (one per row) at each iteration as the network trains. We compute the output distribution over 10 independent runs for each network, and compare with the infinite-width dynamics (black curve). As the width grows, the network function converges to that of the infinite-width dynamics captured in \ref{['thm:mulimit_MLP_general']}.

Theorems & Definitions (195)

  • Remark 1.2.1: Potential Confusion
  • Remark 1.2.2: Potential Confusion
  • Remark 1.2.3: Potential Confusion
  • Definition 1.2.4: Big-O Notation
  • Definition 1.2.5
  • Definition 2.1.1
  • Definition 2.1.2
  • Definition 2.2.1
  • Example 2.2.2
  • Proposition 2.2.3: abcd Redundancy
  • ...and 185 more