Beyond Linearization: On Quadratic and Higher-Order Approximation of Wide Neural Networks
Yu Bai, Jason D. Lee
TL;DR
This paper addresses the gap between NTK-based theory and the observed performance of wide neural networks by introducing a randomization technique that shifts learning away from the NTK regime toward higher-order Taylor approximations, starting with a quadratic model. It establishes that randomized two-layer nets exhibit a favorable optimization landscape, enabling efficient training via escape-saddle methods and large learning rates, while maintaining strong generalization guarantees that match or exceed NTK bounds under mild data assumptions. The authors further show that quadratic models can offer improved sample complexity for learning simple functions (sometimes by a dimension factor), and extend the framework to higher-order NTKs, where the $k$-th order term can dominate the dynamics with corresponding generalization bounds. Overall, the work provides a principled approach to beyond-NTK training, with concrete results on optimization landscapes, generalization through operator-norm based bounds, and expressivity advantages, and it opens avenues for systematically exploiting higher-order Taylor terms in neural networks.
Abstract
Recent theoretical work has established connections between over-parametrized neural networks and linearized models governed by he Neural Tangent Kernels (NTKs). NTK theory leads to concrete convergence and generalization results, yet the empirical performance of neural networks are observed to exceed their linearized models, suggesting insufficiency of this theory. Towards closing this gap, we investigate the training of over-parametrized neural networks that are beyond the NTK regime yet still governed by the Taylor expansion of the network. We bring forward the idea of \emph{randomizing} the neural networks, which allows them to escape their NTK and couple with quadratic models. We show that the optimization landscape of randomized two-layer networks are nice and amenable to escaping-saddle algorithms. We prove concrete generalization and expressivity results on these randomized networks, which lead to sample complexity bounds (of learning certain simple functions) that match the NTK and can in addition be better by a dimension factor when mild distributional assumptions are present. We demonstrate that our randomization technique can be generalized systematically beyond the quadratic case, by using it to find networks that are coupled with higher-order terms in their Taylor series.
