Table of Contents
Fetching ...

Saddle-to-Saddle Dynamics in Deep Linear Networks: Small Initialization Training, Symmetry, and Sparsity

Arthur Jacot, François Ged, Berfin Şimşek, Clément Hongler, Franck Gabriel

TL;DR

This work analyzes how initialization variance, scaled with width, drives distinct training dynamics in deep linear networks. It identifies NTK-like behavior for γ<1 and a less-understood regime for γ>1, culminated by a γ→∞ limit that yields Saddle-to-Saddle dynamics—visiting a sequence of increasing-rank saddles and implying a greedy, low-rank bias toward sparse solutions. A key theoretical result proves a first path from the origin to a rank-1 saddle, while a conjecture generalizes this to a full saddle-to-saddle trajectory with symmetry-driven inclusions and rotations. The framework connects regime choice to implicit sparsity, symmetry, and a greedy low-rank algorithm, offering a pathway to understand and potentially exploit low-rank biases in training dynamics. Overall, the paper bridges kernel and active learning regimes, highlighting how initialization controls the geometry of the loss landscape traversal in DLNs and informing potential extensions to non-linear architectures.

Abstract

The dynamics of Deep Linear Networks (DLNs) is dramatically affected by the variance $σ^2$ of the parameters at initialization $θ_0$. For DLNs of width $w$, we show a phase transition w.r.t. the scaling $γ$ of the variance $σ^2=w^{-γ}$ as $w\to\infty$: for large variance ($γ<1$), $θ_0$ is very close to a global minimum but far from any saddle point, and for small variance ($γ>1$), $θ_0$ is close to a saddle point and far from any global minimum. While the first case corresponds to the well-studied NTK regime, the second case is less understood. This motivates the study of the case $γ\to +\infty$, where we conjecture a Saddle-to-Saddle dynamics: throughout training, gradient descent visits the neighborhoods of a sequence of saddles, each corresponding to linear maps of increasing rank, until reaching a sparse global minimum. We support this conjecture with a theorem for the dynamics between the first two saddles, as well as some numerical experiments.

Saddle-to-Saddle Dynamics in Deep Linear Networks: Small Initialization Training, Symmetry, and Sparsity

TL;DR

This work analyzes how initialization variance, scaled with width, drives distinct training dynamics in deep linear networks. It identifies NTK-like behavior for γ<1 and a less-understood regime for γ>1, culminated by a γ→∞ limit that yields Saddle-to-Saddle dynamics—visiting a sequence of increasing-rank saddles and implying a greedy, low-rank bias toward sparse solutions. A key theoretical result proves a first path from the origin to a rank-1 saddle, while a conjecture generalizes this to a full saddle-to-saddle trajectory with symmetry-driven inclusions and rotations. The framework connects regime choice to implicit sparsity, symmetry, and a greedy low-rank algorithm, offering a pathway to understand and potentially exploit low-rank biases in training dynamics. Overall, the paper bridges kernel and active learning regimes, highlighting how initialization controls the geometry of the loss landscape traversal in DLNs and informing potential extensions to non-linear architectures.

Abstract

The dynamics of Deep Linear Networks (DLNs) is dramatically affected by the variance of the parameters at initialization . For DLNs of width , we show a phase transition w.r.t. the scaling of the variance as : for large variance (), is very close to a global minimum but far from any saddle point, and for small variance (), is close to a saddle point and far from any global minimum. While the first case corresponds to the well-studied NTK regime, the second case is less understood. This motivates the study of the case , where we conjecture a Saddle-to-Saddle dynamics: throughout training, gradient descent visits the neighborhoods of a sequence of saddles, each corresponding to linear maps of increasing rank, until reaching a sparse global minimum. We support this conjecture with a theorem for the dynamics between the first two saddles, as well as some numerical experiments.

Paper Structure

This paper contains 33 sections, 25 theorems, 90 equations, 6 figures, 1 algorithm.

Key Result

Theorem 1

Suppose that the set of matrices that minimize $C$ is non-empty, has Lebesgue measure zero, and does not contain the zero matrix. Let $\theta$ be i.i.d. centered Gaussian r.v. of variance $\sigma^{2}=w^{-\gamma}$ where $1-\frac{1}{L}\leq\gamma<\infty$. Then:

Figures (6)

  • Figure 1: Saddle-to-Saddle dynamics: A DLN ($L=4,w=100$) with a small initialization ($\gamma=2$) trained on a MC loss fitting a $10\times 10$ matrix of rank $3$. Left: Projection onto a plane of the gradient flow path $\theta_\alpha$ in parameter space (in blue) and of the sequence of 3 paths $\theta^1,\theta^2,\theta^3$ (in orange, green and red), described by Algorithm $\mathcal{A}_{\epsilon,T,\eta}$, starting from the origin (+) and passing through 2 saddles ($\cdot$) before converging. Middle: Train (solid) and test (dashed) MC costs through training. We observe three plateaus, corresponding to the three saddles visited. Right: The train (solid) and test (dashed) losses of the three paths plotted sequentially, in the saddle-to-saddle limit; the dots represent an infinite amount of steps separating these paths.
  • Figure 2: Training in (a) the NTK regime, (b) mean-field, (c) saddle-to-saddle regimes in deep linear networks for three widths $w=10,100,1000$, $L=4$, and $10$ seeds. Parameters are initialized with variance $\sigma^2 = w^{-\gamma}$. We observe that (a) in the NTK regime, the training loss shows typical linear convergence behavior for $w=1000$ and $w=100$; (b) in the mean-field regime, we observe that even the large width networks approach to a saddle at the beginning of the training and that the length of the plateaus remains constant between widths $w=1000$ and $w=100$; (c) in the saddle-to-saddle regime, the plateaus become longer as the width grows. In all cases, we see a reduction in the variation between the different seeds as $w \to \infty$.
  • Figure 3: Test errors and ranks at convergence as a function of initialization scale $\gamma$, matrix completion task. The task is finding a matrix of size $30 \times 30$ and rank $1$ from $20\%$ of its entries. The test error and ranks are averaged over $7$ seeds ($\pm 1$ standard deviations are reported in the error bar). In the NTK regime, the solutions at convergence are almost full-rank and the test error is roughly the same or worse than that of the zero predictor. On the other hand in the Saddle-to-Saddle regime the test error approaches zero. As the width grows the transition between regimes becomes sharper and the test error becomes more consistent within each regimes.
  • Figure 4: Matrix Completion in linear/lazy vs. saddle-to-saddle regimes. 3 DLNs ($L=4,w=100$) trained on a MC loss fitting a $10\times 10$ matrix of rank $3$ with initialization $\alpha \theta_0$ for a fixed random $\theta_0$ and three values of $\alpha$. Left: Train (solid) and test (dashed) MC cost for the three networks, for large $\alpha$ the network is in the linear/lazy regime and does not learn the low-rank structure. For smaller $\alpha$ plateaus appear and the network generalizes. Middle: Visualization of the gradient paths in parameter space. The black line represents the manifold of solutions to which all example paths converge. As $\alpha \to 0$ the training trajectory converges to a sequence of 3 paths (in blue, purple and red) starting from the origin (+) and passing through 2 saddles ($\cdot$) before converging. Right: The train (solid) and test (dashed) loss of the three paths plotted sequentially, in the saddle-to-saddle limit; $\cdots$ represent an infinite amount of steps separating these paths.
  • Figure 5: Training in (a) the NTK regime, (b) mean-field, (c) saddle-to-saddle regimes in deep linear networks for three widths $w=10,100,1000$, $L=4$, and $10$ seeds; extension of Fig. \ref{['fig:regimes-of-training']} in the main.Top: The evolution of the rank of the network matrices during training. Tolerance of the matrix is set at $1e-1$. Middle: The evolution of the nuclear norm during training, we can see that the smooth jumps are aligned with the rank transitions. Bottom: The evolution of the gradient norm of the parameters. Decrease of the gradient norm down to zero indicates approaching to a saddle, and the following increase indicates escaping it.
  • ...and 1 more figures

Theorems & Definitions (49)

  • Theorem 1
  • Theorem 2
  • Conjecture 3
  • Remark 4
  • Theorem 5
  • Remark 6
  • Proposition 7
  • proof
  • Theorem 8: Theorem \ref{['th:distances']} in the main
  • Lemma 9
  • ...and 39 more