Table of Contents
Fetching ...

Understanding Gradient Descent through the Training Jacobian

Nora Belrose, Adam Scherlis

TL;DR

This work investigates neural network training dynamics by analyzing the Jacobian of final parameters with respect to initial values, revealing a low-dimensional structure in parameter updates. The authors compute the training Jacobian $J(\theta_0)$ for small networks via forward-mode $AD$, and analyze its singular value spectrum and subspaces to understand perturbation propagation during training. They identify a bulk subspace with many singular values near $1$, along with chaotic and stable regions, showing the bulk is data-driven and largely independent of labels, while perturbations in the bulk minimally affect in-distribution outputs but can influence far out-of-distribution predictions. These findings provide a lens on inductive biases in neural networks and suggest training dynamics concentrate in a structured subspace, with implications for efficiency and robustness.

Abstract

We examine the geometry of neural network training using the Jacobian of trained network parameters with respect to their initial values. Our analysis reveals low-dimensional structure in the training process which is dependent on the input data but largely independent of the labels. We find that the singular value spectrum of the Jacobian matrix consists of three distinctive regions: a "chaotic" region of values orders of magnitude greater than one, a large "bulk" region of values extremely close to one, and a "stable" region of values less than one. Along each bulk direction, the left and right singular vectors are nearly identical, indicating that perturbations to the initialization are carried through training almost unchanged. These perturbations have virtually no effect on the network's output in-distribution, yet do have an effect far out-of-distribution. While the Jacobian applies only locally around a single initialization, we find substantial overlap in bulk subspaces for different random seeds. Our code is available at https://github.com/EleutherAI/training-jacobian

Understanding Gradient Descent through the Training Jacobian

TL;DR

This work investigates neural network training dynamics by analyzing the Jacobian of final parameters with respect to initial values, revealing a low-dimensional structure in parameter updates. The authors compute the training Jacobian for small networks via forward-mode , and analyze its singular value spectrum and subspaces to understand perturbation propagation during training. They identify a bulk subspace with many singular values near , along with chaotic and stable regions, showing the bulk is data-driven and largely independent of labels, while perturbations in the bulk minimally affect in-distribution outputs but can influence far out-of-distribution predictions. These findings provide a lens on inductive biases in neural networks and suggest training dynamics concentrate in a structured subspace, with implications for efficiency and robustness.

Abstract

We examine the geometry of neural network training using the Jacobian of trained network parameters with respect to their initial values. Our analysis reveals low-dimensional structure in the training process which is dependent on the input data but largely independent of the labels. We find that the singular value spectrum of the Jacobian matrix consists of three distinctive regions: a "chaotic" region of values orders of magnitude greater than one, a large "bulk" region of values extremely close to one, and a "stable" region of values less than one. Along each bulk direction, the left and right singular vectors are nearly identical, indicating that perturbations to the initialization are carried through training almost unchanged. These perturbations have virtually no effect on the network's output in-distribution, yet do have an effect far out-of-distribution. While the Jacobian applies only locally around a single initialization, we find substantial overlap in bulk subspaces for different random seeds. Our code is available at https://github.com/EleutherAI/training-jacobian

Paper Structure

This paper contains 12 sections, 3 equations, 7 figures.

Figures (7)

  • Figure 1: Spectral analysis of the training Jacobian for an MLP with a single hidden layer, trained using SGD with momentum for 25 epochs until training loss reached zero.
  • Figure 2: The linearization of training remains valid for much longer along bulk directions than it does along a top singular vector. The orange line indicates the Euclidean norm of the response projected onto the orthogonal complement of the span of the singular vector; if training is purely linear, this quantity should be zero.
  • Figure 3: While the bulk has virtually no effect on in-distribution behavior, it does affect predictions far out-of-distribution (Panels b and d).
  • Figure 4: The parameter-function Jacobian on test images has a fairly large approximate nullspace, which is close to the training-Jacobian bulk. This effect disappears for white-noise images.
  • Figure 5: The bulk subspaces for training trajectories starting at two different random initializations are much closer to each other than they are to a randomly sampled subspace of the same dimension. The same is true for two trajectories with the same initialization, but where one sees randomly shuffled training labels. By contrast, a trajectory which sees white noise images does not even yield a significant bulk at all.
  • ...and 2 more figures