Demystifying Lazy Training of Neural Networks from a Macroscopic Viewpoint

Yuqing Li; Tao Luo; Qixuan Zhou

Demystifying Lazy Training of Neural Networks from a Macroscopic Viewpoint

Yuqing Li, Tao Luo, Qixuan Zhou

TL;DR

This work investigates how the initialization scale affects gradient-descent training in overparameterized deep networks by developing a macroscopic, infinite-width analysis. It introduces a theta-lazy regime, where the initial output scale $\κ$ (with $\lim_{m\to\infty}\frac{\log κ}{\log m}>0$) dictates training dynamics, yielding kernel-like behavior described by normalized Gram matrices and limiting kernels $\{\mathbf{K}^{[l]}\}$. The authors establish concentration and spectral guarantees that ensure positive min-eigenvalues and derive an exponential decay rate for the training loss depending on $\sum_{l=1}^{L+1} \frac{κ^2}{α_l^2} λ_{\mathcal{S}}$, while showing that parameter updates remain small as width grows. The results generalize NTK-style analyses to multi-layer architectures and relax the traditional $m^{-1/2}$ scaling, offering a robust framework to understand how initialization shapes training in deep networks. This theta-lazy macroscopic perspective highlights the pivotal role of the initialization scale in shaping both optimization dynamics and generalization potential across architectures.

Abstract

In this paper, we advance the understanding of neural network training dynamics by examining the intricate interplay of various factors introduced by weight parameters in the initialization process. Motivated by the foundational work of Luo et al. (J. Mach. Learn. Res., Vol. 22, Iss. 1, No. 71, pp 3327-3373), we explore the gradient descent dynamics of neural networks through the lens of macroscopic limits, where we analyze its behavior as width $m$ tends to infinity. Our study presents a unified approach with refined techniques designed for multi-layer fully connected neural networks, which can be readily extended to other neural network architectures. Our investigation reveals that gradient descent can rapidly drive deep neural networks to zero training loss, irrespective of the specific initialization schemes employed by weight parameters, provided that the initial scale of the output function $κ$ surpasses a certain threshold. This regime, characterized as the theta-lazy area, accentuates the predominant influence of the initial scale $κ$ over other factors on the training behavior of neural networks. Furthermore, our approach draws inspiration from the Neural Tangent Kernel (NTK) paradigm, and we expand its applicability. While NTK typically assumes that $\lim_{m\to\infty}\frac{\log κ}{\log m}=\frac{1}{2}$, and imposes each weight parameters to scale by the factor $\frac{1}{\sqrt{m}}$, in our theta-lazy regime, we discard the factor and relax the conditions to $\lim_{m\to\infty}\frac{\log κ}{\log m}>0$. Similar to NTK, the behavior of overparameterized neural networks within the theta-lazy regime trained by gradient descent can be effectively described by a specific kernel. Through rigorous analysis, our investigation illuminates the pivotal role of $κ$ in governing the training dynamics of neural networks.

Demystifying Lazy Training of Neural Networks from a Macroscopic Viewpoint

TL;DR

(with

) dictates training dynamics, yielding kernel-like behavior described by normalized Gram matrices and limiting kernels

. The authors establish concentration and spectral guarantees that ensure positive min-eigenvalues and derive an exponential decay rate for the training loss depending on

, while showing that parameter updates remain small as width grows. The results generalize NTK-style analyses to multi-layer architectures and relax the traditional

scaling, offering a robust framework to understand how initialization shapes training in deep networks. This theta-lazy macroscopic perspective highlights the pivotal role of the initialization scale in shaping both optimization dynamics and generalization potential across architectures.

Abstract

tends to infinity. Our study presents a unified approach with refined techniques designed for multi-layer fully connected neural networks, which can be readily extended to other neural network architectures. Our investigation reveals that gradient descent can rapidly drive deep neural networks to zero training loss, irrespective of the specific initialization schemes employed by weight parameters, provided that the initial scale of the output function

surpasses a certain threshold. This regime, characterized as the theta-lazy area, accentuates the predominant influence of the initial scale

over other factors on the training behavior of neural networks. Furthermore, our approach draws inspiration from the Neural Tangent Kernel (NTK) paradigm, and we expand its applicability. While NTK typically assumes that

, and imposes each weight parameters to scale by the factor

, in our theta-lazy regime, we discard the factor and relax the conditions to

. Similar to NTK, the behavior of overparameterized neural networks within the theta-lazy regime trained by gradient descent can be effectively described by a specific kernel. Through rigorous analysis, our investigation illuminates the pivotal role of

in governing the training dynamics of neural networks.

Paper Structure (19 sections, 17 theorems, 262 equations, 1 figure)

This paper contains 19 sections, 17 theorems, 262 equations, 1 figure.

Introduction
Related Works
Preliminaries
Notations
Problem Setup
Activation Functions and Input Samples
Technique Overview and Main Results
Normalized Outputs and Gram Matrices
Normalized Limiting Gram Matrices
Least Eigenvalue of Normalized Gram Matrices at Initial Stage
A Unified Approach for Multi-layer NNs
Theta-lazy Regime
Statement of the Theorem
Conclusions
Full Rankness of Gram Matrices
...and 4 more sections

Key Result

Proposition 1

Suppose $\sigma(\cdot)$ satisfies conditions in Assumption Assumption....Activation-Function, and $\mathcal{S}$ satisfies conditions in Assumption Assumption...Data, then for any $i\in [n]$ and $l\in[L]$, there exist some positive constants $\mu_1, \mu_2>0$, such that Moreover, as we denote $\lambda_\mathcal{S}:=\min_{l\in[L+1]} \lambda_{\min}\left(\bm{K}^{[l]}\right),$ then $\lambda_\mathcal{

Figures (1)

Figure 1: Sketch of proof for Theorem 1.

Theorems & Definitions (38)

Remark 1
Remark 2
Definition 1: Normalized NN
Definition 2
Definition 3: Normalized Gram Matrices
Definition 4
Definition 5
Definition 6: Normalized Limiting Gram Matrices
Proposition 1
proof
...and 28 more

Demystifying Lazy Training of Neural Networks from a Macroscopic Viewpoint

TL;DR

Abstract

Demystifying Lazy Training of Neural Networks from a Macroscopic Viewpoint

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (38)