Understanding Gradient Descent through the Training Jacobian

Nora Belrose; Adam Scherlis

Understanding Gradient Descent through the Training Jacobian

Nora Belrose, Adam Scherlis

TL;DR

This work investigates neural network training dynamics by analyzing the Jacobian of final parameters with respect to initial values, revealing a low-dimensional structure in parameter updates. The authors compute the training Jacobian $J(\theta_0)$ for small networks via forward-mode $AD$, and analyze its singular value spectrum and subspaces to understand perturbation propagation during training. They identify a bulk subspace with many singular values near $1$, along with chaotic and stable regions, showing the bulk is data-driven and largely independent of labels, while perturbations in the bulk minimally affect in-distribution outputs but can influence far out-of-distribution predictions. These findings provide a lens on inductive biases in neural networks and suggest training dynamics concentrate in a structured subspace, with implications for efficiency and robustness.

Abstract

We examine the geometry of neural network training using the Jacobian of trained network parameters with respect to their initial values. Our analysis reveals low-dimensional structure in the training process which is dependent on the input data but largely independent of the labels. We find that the singular value spectrum of the Jacobian matrix consists of three distinctive regions: a "chaotic" region of values orders of magnitude greater than one, a large "bulk" region of values extremely close to one, and a "stable" region of values less than one. Along each bulk direction, the left and right singular vectors are nearly identical, indicating that perturbations to the initialization are carried through training almost unchanged. These perturbations have virtually no effect on the network's output in-distribution, yet do have an effect far out-of-distribution. While the Jacobian applies only locally around a single initialization, we find substantial overlap in bulk subspaces for different random seeds. Our code is available at https://github.com/EleutherAI/training-jacobian

Understanding Gradient Descent through the Training Jacobian

TL;DR

Abstract

Understanding Gradient Descent through the Training Jacobian

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)