Table of Contents
Fetching ...

Deep learning versus kernel learning: an empirical study of loss landscape geometry and the time evolution of the Neural Tangent Kernel

Stanislav Fort, Gintare Karolina Dziugaite, Mansheej Paul, Sepideh Kharaghani, Daniel M. Roy, Surya Ganguli

TL;DR

The paper investigates how deep networks grow into their learned functions by jointly examining loss-landscape geometry and the time evolution of data-dependent neural tangent kernels (NTKs). Through large-scale, multi-architecture measurements, it reveals a rapid chaotic early phase (2–3 epochs) that determines basin fate, followed by slower NTK evolution where a data-dependent NTK rapidly acquires useful features and can approach full nonlinear performance within a fraction of training. The work shows strong cross-network correlations among diverse metrics—kernel velocity, function-space distances, and barrier changes—supporting a universal learning picture. It also demonstrates the limits of the NTK as a complete descriptor of training at finite width and finite learning rates, motivating refined theories and more effective learning-rate schedules grounded in early-training dynamics.

Abstract

In suitably initialized wide networks, small learning rates transform deep neural networks (DNNs) into neural tangent kernel (NTK) machines, whose training dynamics is well-approximated by a linear weight expansion of the network at initialization. Standard training, however, diverges from its linearization in ways that are poorly understood. We study the relationship between the training dynamics of nonlinear deep networks, the geometry of the loss landscape, and the time evolution of a data-dependent NTK. We do so through a large-scale phenomenological analysis of training, synthesizing diverse measures characterizing loss landscape geometry and NTK dynamics. In multiple neural architectures and datasets, we find these diverse measures evolve in a highly correlated manner, revealing a universal picture of the deep learning process. In this picture, deep network training exhibits a highly chaotic rapid initial transient that within 2 to 3 epochs determines the final linearly connected basin of low loss containing the end point of training. During this chaotic transient, the NTK changes rapidly, learning useful features from the training data that enables it to outperform the standard initial NTK by a factor of 3 in less than 3 to 4 epochs. After this rapid chaotic transient, the NTK changes at constant velocity, and its performance matches that of full network training in 15% to 45% of training time. Overall, our analysis reveals a striking correlation between a diverse set of metrics over training time, governed by a rapid chaotic to stable transition in the first few epochs, that together poses challenges and opportunities for the development of more accurate theories of deep learning.

Deep learning versus kernel learning: an empirical study of loss landscape geometry and the time evolution of the Neural Tangent Kernel

TL;DR

The paper investigates how deep networks grow into their learned functions by jointly examining loss-landscape geometry and the time evolution of data-dependent neural tangent kernels (NTKs). Through large-scale, multi-architecture measurements, it reveals a rapid chaotic early phase (2–3 epochs) that determines basin fate, followed by slower NTK evolution where a data-dependent NTK rapidly acquires useful features and can approach full nonlinear performance within a fraction of training. The work shows strong cross-network correlations among diverse metrics—kernel velocity, function-space distances, and barrier changes—supporting a universal learning picture. It also demonstrates the limits of the NTK as a complete descriptor of training at finite width and finite learning rates, motivating refined theories and more effective learning-rate schedules grounded in early-training dynamics.

Abstract

In suitably initialized wide networks, small learning rates transform deep neural networks (DNNs) into neural tangent kernel (NTK) machines, whose training dynamics is well-approximated by a linear weight expansion of the network at initialization. Standard training, however, diverges from its linearization in ways that are poorly understood. We study the relationship between the training dynamics of nonlinear deep networks, the geometry of the loss landscape, and the time evolution of a data-dependent NTK. We do so through a large-scale phenomenological analysis of training, synthesizing diverse measures characterizing loss landscape geometry and NTK dynamics. In multiple neural architectures and datasets, we find these diverse measures evolve in a highly correlated manner, revealing a universal picture of the deep learning process. In this picture, deep network training exhibits a highly chaotic rapid initial transient that within 2 to 3 epochs determines the final linearly connected basin of low loss containing the end point of training. During this chaotic transient, the NTK changes rapidly, learning useful features from the training data that enables it to outperform the standard initial NTK by a factor of 3 in less than 3 to 4 epochs. After this rapid chaotic transient, the NTK changes at constant velocity, and its performance matches that of full network training in 15% to 45% of training time. Overall, our analysis reveals a striking correlation between a diverse set of metrics over training time, governed by a rapid chaotic to stable transition in the first few epochs, that together poses challenges and opportunities for the development of more accurate theories of deep learning.

Paper Structure

This paper contains 33 sections, 11 equations, 14 figures, 1 table.

Figures (14)

  • Figure 1: A conceptual overview of diverse deep learning phenomenology. (A) A schematic picture of the region of low loss (black area) in weight space as a network of high dimensional basins with lower dimensional intersections, motivated by recent work li2018measuringgoldilocksdraxler2018essentiallygaripov2018lossfort2019large. Two networks (red points) in different basins can be connected by a low loss nonlinear path (yellow) but not by a low loss linear path (orange). (B) A schematic view of the process of hierarchically exploring loss landscapes by spawning child networks frankle2019linear. A randomly initialized parent network (blue point) is trained up to a certain spawn epoch (green point) at which two (or more) child networks are spawned from with identical weights and then subsequently trained independently with different SGD minibatches (bifurcating blue lines). Two children spawned later (earlier) than a very early transition time in parent training, will arrive at the same (different) basin on the loss landscape. (C) A schematic view of NTK training. The black curve is the space of functions $f_w$ realizable by varying the parameters $w$ of a neural network and full network training proceeds along this curved function manifold (blue to green to red points). NTK training linearizes the manifold at initialization (blue point), and trains along the tangent space (blue line). Such linearized training is equivalent to kernel regression in function space where the kernel is closely related to the tangent plane along which training occurs. This panel shows a case where NTK and full nonlinear training are similar in that the kernel at initialization does not change much over learning, as shown schematically by the similar orientations of the initial (blue), intermediate (green) and final (red) tangent spaces. (D) The same as in panel (C), except now showing schematically a case where the NTK method is very different from full nonlinear training, in which the kernel changes considerably, as evidenced by the strong twisting of tangent spaces (blue, green and red lines), resulting in a final learned kernel (associated with the red tangent space) that is quite different from the initial random kernel (associated with the blue tangent space). (E) Consider an error landscape with a sharp and a wide minimum separated by an error barrier. With a small learning rate (bottom), a learning trajectory starting at an initial point (blue) will slowly descend through intermediate points (green) to a minimum position (red) in the sharp minimum, and is unable to escape it. With a larger learning rate (top), a learning trajectory that starts in the sharp minimum at a position (blue point) that is even lower than the error barrier, can escape the sharp minimum.
  • Figure 2: SOTA ResNet20 trained on CIFAR10 using SGD with momentum and learning rate drops.
  • Figure 5: The error landscape and function space geometry on a 2D section defined by a pair of children (red and blue stars) and the spawning parent (purple cross) when the spawn point is in the early chaotic (top row) and late stable (bottom row) regimes of training. All other training points are projected to this $2D$ section. The left two columns show, as a function of position on this $2D$ section, the test error and the function space distance to a chosen child (blue star). The right two columns show the test error along an affine tangent plane in function space obtained by a first order Taylor expansion of $f_w$ in weight space around the weights of two different networks (the spawning parent and one of the children). A function space point along the tangent plane at $f_w$ is identified with a point on the curved 2D section in function space through the relation $f_w + \Delta w \cdot \nabla_w f_w \rightarrow f_{w+\Delta w}$.
  • Figure 6: Relation between error barrier and child function distance for ResNet20 on CIFAR 10 and 100. Left panels show how final child distance (near 200 epochs) falls off with spawn epoch (red curve). The purple baseline indicates final distance between two independent parents. Right panels plot function distance as a function of error barrier. See also \ref{['fig:acc_and_fndistance_vs_epoch_resnet_cifar10and100']} for detailed evolution of both quantities with training rather than spawn epoch. Error bars reflect std. dev. across the last 25 epochs. The function prediction embeddings are shown in \ref{['fig:tsne_bn_main']}.
  • Figure 7: Relation between test error barrier and kernel velocity for a ResNet20 trained on CIFAR10 and CIFAR100. Both the test error barrier between children (left) and the kernel velocity of the parent (middle) fall off and stabilize early in time and exhibit strongly correlated dynamics (right).
  • ...and 9 more figures