Better NTK Conditioning: A Free Lunch from (ReLU) Nonlinear Activation in Wide Neural Networks

Chaoyue Liu; Han Bi; Like Hui; Xiao Liu

Better NTK Conditioning: A Free Lunch from (ReLU) Nonlinear Activation in Wide Neural Networks

Chaoyue Liu, Han Bi, Like Hui, Xiao Liu

TL;DR

This work reveals that nonlinear activation, particularly ReLU, not only enhances expressivity but also improves optimization by increasing data separation in the model gradient space and reducing NTK conditioning. The authors prove that for wide ReLU networks, similar inputs become more directionally separated in gradient space (with the model-gradient angle $\phi$ exceeding the input angle $\theta_{in}$), and that deeper networks amplify this effect; in the infinite-width-then-depth limit, $\phi$ tends to $\arccos(1/4)\approx 75.5^\circ$. They also show that NTK conditioning improves with nonlinearity, with the smallest eigenvalues of the NTK increasing and the condition number $\kappa$ decreasing, and that in the infinite-depth regime $\kappa$ converges to $(n+4)/3$, independent of data. Empirical results on diverse datasets corroborate faster convergence with deeper nonlinear networks, while highlighting a potential optimization-generalization trade-off at very large depths. Overall, the paper connects activation-induced data separation and NTK conditioning to practical convergence benefits, offering guidance for designing and training wide neural networks.

Abstract

Nonlinear activation functions are widely recognized for enhancing the expressivity of neural networks, which is the primary reason for their widespread implementation. In this work, we focus on ReLU activation and reveal a novel and intriguing property of nonlinear activations. By comparing enabling and disabling the nonlinear activations in the neural network, we demonstrate their specific effects on wide neural networks: (a) better feature separation, i.e., a larger angle separation for similar data in the feature space of model gradient, and (b) better NTK conditioning, i.e., a smaller condition number of neural tangent kernel (NTK). Furthermore, we show that the network depth (i.e., with more nonlinear activation operations) further amplifies these effects; in addition, in the infinite-width-then-depth limit, all data are equally separated with a fixed angle in the model gradient feature space, regardless of how similar they are originally in the input space. Note that, without the nonlinear activation, i.e., in a linear neural network, the data separation remains the same as for the original inputs and NTK condition number is equivalent to the Gram matrix, regardless of the network depth. Due to the close connection between NTK condition number and convergence theories, our results imply that nonlinear activation helps to improve the worst-case convergence rates of gradient based methods.

Better NTK Conditioning: A Free Lunch from (ReLU) Nonlinear Activation in Wide Neural Networks

TL;DR

exceeding the input angle

), and that deeper networks amplify this effect; in the infinite-width-then-depth limit,

tends to

. They also show that NTK conditioning improves with nonlinearity, with the smallest eigenvalues of the NTK increasing and the condition number

decreasing, and that in the infinite-depth regime

converges to

, independent of data. Empirical results on diverse datasets corroborate faster convergence with deeper nonlinear networks, while highlighting a potential optimization-generalization trade-off at very large depths. Overall, the paper connects activation-induced data separation and NTK conditioning to practical convergence benefits, offering guidance for designing and training wide neural networks.

Abstract

Paper Structure (49 sections, 19 theorems, 88 equations, 6 figures, 1 table)

This paper contains 49 sections, 19 theorems, 88 equations, 6 figures, 1 table.

Introduction
Contributions.
Related work
Setup and Preliminaries
Notations for general purpose.
(Fully-connected) neural network.
Linear neural network.
Model gradient feature and neural tangent kernel (NTK).
Condition number.
Without nonlinear activation: the baseline for comparison
Better separation in model gradient space
Better feature separation.
Better separation in infinite width and depth limit.
Beyond ReLU activation.
Better NTK conditioning
...and 34 more sections

Key Result

Theorem 2.3

Consider a linear neural network $\bar{f}$. In the limit of infinite network width $m\to\infty$ and at network initialization ${\mathbf{w}}_0$, the following relations hold:

Figures (6)

Figure 1: Model gradient angles $\phi$ vs. input angle $\theta_{in}$ (according to Lemma \ref{['lemma:gradient_angle_expression']}). Linear neural networks (black dash line), of any depth $L$, always have $\bar{\phi} = \theta_{in}$. ReLU neural networks with various depths have better separation $\phi > \theta_{in}$ for similar data (i.e., small $\theta_{in}$). Deeper ReLU networks have better separation than shallow ones for similar data. All neural networks are infinitely wide.
Figure 2: Better separation for non-ReLU activation functions. Left: ReLU$^2$, Middle: GeLU, Right: tanh. All plots are model gradient angle $\phi$ vs. input $\theta_{in}$.
Figure 3: Better separation (left) and Better NTK conditioning (right) of ReLU network on various datasets. Solid lines are of ReLU networks, dashed lines are of linear neural networks for comparison. Left: Minimum $\phi$ (in degrees $^\circ$) vs. depth. ReLU network has better separation of model gradient feature as depth increases. Right: NTK condition number vs. depth. ReLU network has better conditioning of NTK as depth increases. Note that $L=0$ corresponds to the case of a linear model and a linear neural network, and the NTK in this case is the Gram matrix.
Figure 4: Training curve of ReLU networks with different depths. On each of these datasets, we see that deeper ReLU network always converges faster than shallower ones.
Figure 5: Curve of the function $g(\theta)$. As can be seen, $g(\theta)$ is monotonic, and is approximately the identity function $y=\theta$ in the small angle region ($\theta \ll 90^\circ$).
...and 1 more figures

Theorems & Definitions (39)

Remark 2.1
Definition 2.2: Model gradient angle
Theorem 2.3
Corollary 2.4: NTK condition number without activation
Lemma 3.1
Theorem 3.2: Better separation for similar data
Corollary 3.3
Remark 3.4: Separation in distance
Theorem 3.5
Remark 3.6
...and 29 more

Better NTK Conditioning: A Free Lunch from (ReLU) Nonlinear Activation in Wide Neural Networks

TL;DR

Abstract

Better NTK Conditioning: A Free Lunch from (ReLU) Nonlinear Activation in Wide Neural Networks

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (39)