Table of Contents
Fetching ...

Saddle-To-Saddle Dynamics in Deep ReLU Networks: Low-Rank Bias in the First Saddle Escape

Ioannis Bantzis, James B. Simon, Arthur Jacot

TL;DR

The paper addresses how gradient descent escapes the origin saddle in deep ReLU networks initialized with small weights, introducing escape directions and the concept of Saddle-to-Saddle dynamics. It develops a gradient-flow framework for homogeneous losses, proving that the optimal escape directions exhibit a strong depth-dependent low-rank bias, with deeper layers becoming nearly rank-1 and more linear, and that the optimal escape speed increases with depth. The work also connects these findings to Bottleneck Rank (BN-rank) ideas, showing a half-bottleneck structure in the first escape and providing MNIST experiments that corroborate the low-rank tendency in later layers, plus a counterexample illustrating that exact rank-1 escapes are not always optimal. The results offer a first step toward a rigorous Saddle-to-Saddle theory in deep ReLU networks and motivate a BN-rank incremental learning view of progressive sparsity and feature learning in deep nets. Overall, the work links initialization scale, depth, and low-rank structure to the trajectory of GD through a sequence of saddles, suggesting practical implications for understanding and guiding deep network training dynamics.

Abstract

When a deep ReLU network is initialized with small weights, GD is at first dominated by the saddle at the origin in parameter space. We study the so-called escape directions, which play a similar role as the eigenvectors of the Hessian for strict saddles. We show that the optimal escape direction features a low-rank bias in its deeper layers: the first singular value of the $\ell$-th layer weight matrix is at least $\ell^{\frac{1}{4}}$ larger than any other singular value. We also prove a number of related results about these escape directions. We argue that this result is a first step in proving Saddle-to-Saddle dynamics in deep ReLU networks, where GD visits a sequence of saddles with increasing bottleneck rank.

Saddle-To-Saddle Dynamics in Deep ReLU Networks: Low-Rank Bias in the First Saddle Escape

TL;DR

The paper addresses how gradient descent escapes the origin saddle in deep ReLU networks initialized with small weights, introducing escape directions and the concept of Saddle-to-Saddle dynamics. It develops a gradient-flow framework for homogeneous losses, proving that the optimal escape directions exhibit a strong depth-dependent low-rank bias, with deeper layers becoming nearly rank-1 and more linear, and that the optimal escape speed increases with depth. The work also connects these findings to Bottleneck Rank (BN-rank) ideas, showing a half-bottleneck structure in the first escape and providing MNIST experiments that corroborate the low-rank tendency in later layers, plus a counterexample illustrating that exact rank-1 escapes are not always optimal. The results offer a first step toward a rigorous Saddle-to-Saddle theory in deep ReLU networks and motivate a BN-rank incremental learning view of progressive sparsity and feature learning in deep nets. Overall, the work links initialization scale, depth, and low-rank structure to the trajectory of GD through a sequence of saddles, suggesting practical implications for understanding and guiding deep network training dynamics.

Abstract

When a deep ReLU network is initialized with small weights, GD is at first dominated by the saddle at the origin in parameter space. We study the so-called escape directions, which play a similar role as the eigenvectors of the Hessian for strict saddles. We show that the optimal escape direction features a low-rank bias in its deeper layers: the first singular value of the -th layer weight matrix is at least larger than any other singular value. We also prove a number of related results about these escape directions. We argue that this result is a first step in proving Saddle-to-Saddle dynamics in deep ReLU networks, where GD visits a sequence of saddles with increasing bottleneck rank.

Paper Structure

This paper contains 25 sections, 12 theorems, 95 equations, 6 figures.

Key Result

Proposition 2.2

Considering gradient flow on the localized loss $\mathcal{L}_0$, if at some time $t_0$ the parameter satisfies then for all $t \ge t_0$ the normalized direction remains constant, and the norm $\|\theta(t)\|$ satisfies

Figures (6)

  • Figure 1: Deeper layers show a stronger bias toward low-rank structure than earlier layers on MNIST.Left: Training loss over training time. Vertical lines indicate the specific iterations at which singular values are extracted. Center and Right: Top 10 singular values of the weight matrices per layer $\ell$ for layers 1–6 including input and output layer.
  • Figure 2: Depth-3 neural networks find rank-two escape directions on a toy dataset.Left: visualization of the dataset. Red and blue points have loss gradient values $G = 1$ and $G = -1$, respectively. Center: several training runs of projected gradient descent on the first-order loss objective under the parameter norm constraint $\norm{\theta}^2 = L$. Runs whose objective exceeds $\sqrt{2} - 1$, the best achievable value for rank-one weights, are colored blue and deemed successful. Right: as width increases, the fraction of successful runs increases. See Figure \ref{['fig:all_counterexample_runs']} for a visualization of the training runs at all widths.
  • Figure 3: Deeper layers show a stronger bias toward low-rank structure than earlier layers on MNIST.Top two rows: Top 10 singular values of the weight matrices for layers 1–6 including input and output layer over training time. Bottom: Training loss trajectory on MNIST.
  • Figure 4: Depth-4 MLP with small initialization on MNIST.Top two rows: Top 10 singular values of the weight matrices for layers 1–4 including input and output layer over training time. Bottom: Training loss trajectory on MNIST.
  • Figure 5: Visualization of Equation \ref{['eqn:speed_by_phi']}.
  • ...and 1 more figures

Theorems & Definitions (21)

  • Definition 2.1
  • Proposition 2.2
  • Theorem 3.1
  • Proposition 3.2
  • Proposition 3.3
  • Proposition 3.4
  • Example 1: Rank-two optimal escape direction
  • Proposition A.1
  • proof
  • Proposition A.2
  • ...and 11 more