Table of Contents
Fetching ...

Training on the Edge of Stability Is Caused by Layerwise Jacobian Alignment

Mark Lowell, Catharine Kastner

TL;DR

This paper investigates why neural networks training often operates on the edge of stability, showing that the rise in Hessian sharpness is driven by layerwise Jacobian alignment rather than the total gradient alone. By training with an exponential Euler scheme that tracks the true gradient flow, the authors dissect the Hessian into Gauss-Newton and residual components and reveal that the top sharpness arises mainly from the layerwise Gauss-Newton term $G$, with the dominant contribution coming from the first layer. They quantify how alignment between layerwise Jacobians and backpropagated perturbations grows during training and scales with dataset size according to a power law, indicating a principled mechanism behind edge-of-stability phenomena. The findings provide a mechanistic link between network architecture, data complexity, and dynamical training behavior, with implications for understanding optimization dynamics and generalization in deep networks.

Abstract

During neural network training, the sharpness of the Hessian matrix of the training loss rises until training is on the edge of stability. As a result, even nonstochastic gradient descent does not accurately model the underlying dynamical system defined by the gradient flow of the training loss. We use an exponential Euler solver to train the network without entering the edge of stability, so that we accurately approximate the true gradient descent dynamics. We demonstrate experimentally that the increase in the sharpness of the Hessian matrix is caused by the layerwise Jacobian matrices of the network becoming aligned, so that a small change in the network preactivations near the inputs of the network can cause a large change in the outputs of the network. We further demonstrate that the degree of alignment scales with the size of the dataset by a power law with a coefficient of determination between 0.74 and 0.98.

Training on the Edge of Stability Is Caused by Layerwise Jacobian Alignment

TL;DR

This paper investigates why neural networks training often operates on the edge of stability, showing that the rise in Hessian sharpness is driven by layerwise Jacobian alignment rather than the total gradient alone. By training with an exponential Euler scheme that tracks the true gradient flow, the authors dissect the Hessian into Gauss-Newton and residual components and reveal that the top sharpness arises mainly from the layerwise Gauss-Newton term , with the dominant contribution coming from the first layer. They quantify how alignment between layerwise Jacobians and backpropagated perturbations grows during training and scales with dataset size according to a power law, indicating a principled mechanism behind edge-of-stability phenomena. The findings provide a mechanistic link between network architecture, data complexity, and dynamical training behavior, with implications for understanding optimization dynamics and generalization in deep networks.

Abstract

During neural network training, the sharpness of the Hessian matrix of the training loss rises until training is on the edge of stability. As a result, even nonstochastic gradient descent does not accurately model the underlying dynamical system defined by the gradient flow of the training loss. We use an exponential Euler solver to train the network without entering the edge of stability, so that we accurately approximate the true gradient descent dynamics. We demonstrate experimentally that the increase in the sharpness of the Hessian matrix is caused by the layerwise Jacobian matrices of the network becoming aligned, so that a small change in the network preactivations near the inputs of the network can cause a large change in the outputs of the network. We further demonstrate that the degree of alignment scales with the size of the dataset by a power law with a coefficient of determination between 0.74 and 0.98.
Paper Structure (22 sections, 2 theorems, 54 equations, 42 figures, 3 tables)

This paper contains 22 sections, 2 theorems, 54 equations, 42 figures, 3 tables.

Key Result

Lemma 7.1

Let $M$ be a matrix, and let $\lambda, v$ be a nonzero eigenvalue, eigenvector of $M^TM$. Then $\lambda, Mv / ||Mv||$ are a nonzero eigenvalue, eigenvector of $MM^T$.

Figures (42)

  • Figure 1: Sharpness of $\mathcal{H}_\theta\widetilde{\mathcal{L}}$ (left), $H$ (middle), and $G$ (right) when trained on CIFAR-10 with cross-entropy
  • Figure 2: $\rho(K)$ (left) and $\mathbb{E}||K||^2_{\max}$ (right) when trained on CIFAR-10 with cross-entropy
  • Figure 3: $\mathbb{E} ||K_i||^2_{\max}$ for $i = 1$ (left), 3 (middle), and 5 (right) when trained on CIFAR-10 with cross-entropy
  • Figure 4: Change in components of $\mathbb{E}||\Delta^1||^2_{\max}$ when trained on CIFAR-10 with cross-entropy: $\Pi_\chi^1$ (top left), $\mathrm{P}_{\chi,\Delta/\chi}^1$ (top middle), $\Pi_J^1$ (top right), $\mathrm{P}_{\Delta,J}^1$ (bottom left), and $\mathbb{E}||\Delta^L||^2_{\max}$ (bottom middle).
  • Figure 5: Examples of synthetic imagery. Each column contains examples from a different class.
  • ...and 37 more figures

Theorems & Definitions (2)

  • Lemma 7.1
  • Lemma 7.2