Better NTK Conditioning: A Free Lunch from (ReLU) Nonlinear Activation in Wide Neural Networks
Chaoyue Liu, Han Bi, Like Hui, Xiao Liu
TL;DR
This work reveals that nonlinear activation, particularly ReLU, not only enhances expressivity but also improves optimization by increasing data separation in the model gradient space and reducing NTK conditioning. The authors prove that for wide ReLU networks, similar inputs become more directionally separated in gradient space (with the model-gradient angle $\phi$ exceeding the input angle $\theta_{in}$), and that deeper networks amplify this effect; in the infinite-width-then-depth limit, $\phi$ tends to $\arccos(1/4)\approx 75.5^\circ$. They also show that NTK conditioning improves with nonlinearity, with the smallest eigenvalues of the NTK increasing and the condition number $\kappa$ decreasing, and that in the infinite-depth regime $\kappa$ converges to $(n+4)/3$, independent of data. Empirical results on diverse datasets corroborate faster convergence with deeper nonlinear networks, while highlighting a potential optimization-generalization trade-off at very large depths. Overall, the paper connects activation-induced data separation and NTK conditioning to practical convergence benefits, offering guidance for designing and training wide neural networks.
Abstract
Nonlinear activation functions are widely recognized for enhancing the expressivity of neural networks, which is the primary reason for their widespread implementation. In this work, we focus on ReLU activation and reveal a novel and intriguing property of nonlinear activations. By comparing enabling and disabling the nonlinear activations in the neural network, we demonstrate their specific effects on wide neural networks: (a) better feature separation, i.e., a larger angle separation for similar data in the feature space of model gradient, and (b) better NTK conditioning, i.e., a smaller condition number of neural tangent kernel (NTK). Furthermore, we show that the network depth (i.e., with more nonlinear activation operations) further amplifies these effects; in addition, in the infinite-width-then-depth limit, all data are equally separated with a fixed angle in the model gradient feature space, regardless of how similar they are originally in the input space. Note that, without the nonlinear activation, i.e., in a linear neural network, the data separation remains the same as for the original inputs and NTK condition number is equivalent to the Gram matrix, regardless of the network depth. Due to the close connection between NTK condition number and convergence theories, our results imply that nonlinear activation helps to improve the worst-case convergence rates of gradient based methods.
