Deep learning versus kernel learning: an empirical study of loss landscape geometry and the time evolution of the Neural Tangent Kernel
Stanislav Fort, Gintare Karolina Dziugaite, Mansheej Paul, Sepideh Kharaghani, Daniel M. Roy, Surya Ganguli
TL;DR
The paper investigates how deep networks grow into their learned functions by jointly examining loss-landscape geometry and the time evolution of data-dependent neural tangent kernels (NTKs). Through large-scale, multi-architecture measurements, it reveals a rapid chaotic early phase (2–3 epochs) that determines basin fate, followed by slower NTK evolution where a data-dependent NTK rapidly acquires useful features and can approach full nonlinear performance within a fraction of training. The work shows strong cross-network correlations among diverse metrics—kernel velocity, function-space distances, and barrier changes—supporting a universal learning picture. It also demonstrates the limits of the NTK as a complete descriptor of training at finite width and finite learning rates, motivating refined theories and more effective learning-rate schedules grounded in early-training dynamics.
Abstract
In suitably initialized wide networks, small learning rates transform deep neural networks (DNNs) into neural tangent kernel (NTK) machines, whose training dynamics is well-approximated by a linear weight expansion of the network at initialization. Standard training, however, diverges from its linearization in ways that are poorly understood. We study the relationship between the training dynamics of nonlinear deep networks, the geometry of the loss landscape, and the time evolution of a data-dependent NTK. We do so through a large-scale phenomenological analysis of training, synthesizing diverse measures characterizing loss landscape geometry and NTK dynamics. In multiple neural architectures and datasets, we find these diverse measures evolve in a highly correlated manner, revealing a universal picture of the deep learning process. In this picture, deep network training exhibits a highly chaotic rapid initial transient that within 2 to 3 epochs determines the final linearly connected basin of low loss containing the end point of training. During this chaotic transient, the NTK changes rapidly, learning useful features from the training data that enables it to outperform the standard initial NTK by a factor of 3 in less than 3 to 4 epochs. After this rapid chaotic transient, the NTK changes at constant velocity, and its performance matches that of full network training in 15% to 45% of training time. Overall, our analysis reveals a striking correlation between a diverse set of metrics over training time, governed by a rapid chaotic to stable transition in the first few epochs, that together poses challenges and opportunities for the development of more accurate theories of deep learning.
