Depth-induced NTK: Bridging Over-parameterized Neural Networks and Deep Neural Kernels
Yong-Ming Tian, Shuang Liang, Shao-Qun Zhang, Feng-Lei Fan
TL;DR
This work addresses the gap in understanding depth within neural kernel theory by introducing a depth-induced NTK, NTK_(d), derived from a shortcut-related architecture and provably convergent to a Gaussian process as depth and shortcut count grow. It establishes existence, spectral bounds, and training invariance for NTK_(d), showing the kernel remains well-conditioned and interpretable during training. Empirical results on sine regression and image datasets demonstrate that NTK_(d) achieves competitive performance with the traditional width-based NTK while offering enhanced stability and a clearer link between depth and representation learning. The study advances neural kernel theory by elucidating depth-focused scaling laws and opens pathways to analyze deep networks beyond the infinite-width paradigm.
Abstract
While deep learning has achieved remarkable success across a wide range of applications, its theoretical understanding of representation learning remains limited. Deep neural kernels provide a principled framework to interpret over-parameterized neural networks by mapping hierarchical feature transformations into kernel spaces, thereby combining the expressive power of deep architectures with the analytical tractability of kernel methods. Recent advances, particularly neural tangent kernels (NTKs) derived by gradient inner products, have established connections between infinitely wide neural networks and nonparametric Bayesian inference. However, the existing NTK paradigm has been predominantly confined to the infinite-width regime, while overlooking the representational role of network depth. To address this gap, we propose a depth-induced NTK kernel based on a shortcut-related architecture, which converges to a Gaussian process as the network depth approaches infinity. We theoretically analyze the training invariance and spectrum properties of the proposed kernel, which stabilizes the kernel dynamics and mitigates degeneration. Experimental results further underscore the effectiveness of our proposed method. Our findings significantly extend the existing landscape of the neural kernel theory and provide an in-depth understanding of deep learning and the scaling law.
