Tensor Programs IVb: Adaptive Optimization in the Infinite-Width Limit
Greg Yang, Etai Littwin
TL;DR
The paper extends infinite-width neural network theory to adaptive optimizers by introducing NexorT, a Tensor Program framework that models nonlinear gradient processing via nonlinear outer products. It establishes Neural Tangent and Maximal Update limits for arbitrary architectures under entrywise updates, showing that adaptive optimizers preserve the dynamical dichotomy between feature learning and kernel-like behavior, with a nonlinear operator in the kernel regime. Central to the theory are the master theorem for NexorT and the bra-ket notation, which together yield a universal, architecture-agnostic description of training dynamics in the infinite-width limit. The results unify prior SGD-based analyses with adaptive optimization, classify abcd-parametrizations into feature-learning versus operator regimes, and provide a path to analyze future optimizers within a rigorous, width-asymptotic framework.
Abstract
Going beyond stochastic gradient descent (SGD), what new phenomena emerge in wide neural networks trained by adaptive optimizers like Adam? Here we show: The same dichotomy between feature learning and kernel behaviors (as in SGD) holds for general optimizers as well, including Adam -- albeit with a nonlinear notion of "kernel." We derive the corresponding "neural tangent" and "maximal update" limits for any architecture. Two foundational advances underlie the above results: 1) A new Tensor Program language, NEXORT, that can express how adaptive optimizers process gradients into updates. 2) The introduction of bra-ket notation to drastically simplify expressions and calculations in Tensor Programs. This work summarizes and generalizes all previous results in the Tensor Programs series of papers.
