Saddle-to-Saddle Dynamics in Deep Linear Networks: Small Initialization Training, Symmetry, and Sparsity

Arthur Jacot; François Ged; Berfin Şimşek; Clément Hongler; Franck Gabriel

Saddle-to-Saddle Dynamics in Deep Linear Networks: Small Initialization Training, Symmetry, and Sparsity

Arthur Jacot, François Ged, Berfin Şimşek, Clément Hongler, Franck Gabriel

TL;DR

This work analyzes how initialization variance, scaled with width, drives distinct training dynamics in deep linear networks. It identifies NTK-like behavior for γ<1 and a less-understood regime for γ>1, culminated by a γ→∞ limit that yields Saddle-to-Saddle dynamics—visiting a sequence of increasing-rank saddles and implying a greedy, low-rank bias toward sparse solutions. A key theoretical result proves a first path from the origin to a rank-1 saddle, while a conjecture generalizes this to a full saddle-to-saddle trajectory with symmetry-driven inclusions and rotations. The framework connects regime choice to implicit sparsity, symmetry, and a greedy low-rank algorithm, offering a pathway to understand and potentially exploit low-rank biases in training dynamics. Overall, the paper bridges kernel and active learning regimes, highlighting how initialization controls the geometry of the loss landscape traversal in DLNs and informing potential extensions to non-linear architectures.

Abstract

The dynamics of Deep Linear Networks (DLNs) is dramatically affected by the variance $σ^2$ of the parameters at initialization $θ_0$. For DLNs of width $w$, we show a phase transition w.r.t. the scaling $γ$ of the variance $σ^2=w^{-γ}$ as $w\to\infty$: for large variance ($γ<1$), $θ_0$ is very close to a global minimum but far from any saddle point, and for small variance ($γ>1$), $θ_0$ is close to a saddle point and far from any global minimum. While the first case corresponds to the well-studied NTK regime, the second case is less understood. This motivates the study of the case $γ\to +\infty$, where we conjecture a Saddle-to-Saddle dynamics: throughout training, gradient descent visits the neighborhoods of a sequence of saddles, each corresponding to linear maps of increasing rank, until reaching a sparse global minimum. We support this conjecture with a theorem for the dynamics between the first two saddles, as well as some numerical experiments.

Saddle-to-Saddle Dynamics in Deep Linear Networks: Small Initialization Training, Symmetry, and Sparsity

TL;DR

Abstract

The dynamics of Deep Linear Networks (DLNs) is dramatically affected by the variance

of the parameters at initialization

. For DLNs of width

, we show a phase transition w.r.t. the scaling

of the variance

: for large variance (

is very close to a global minimum but far from any saddle point, and for small variance (

is close to a saddle point and far from any global minimum. While the first case corresponds to the well-studied NTK regime, the second case is less understood. This motivates the study of the case

, where we conjecture a Saddle-to-Saddle dynamics: throughout training, gradient descent visits the neighborhoods of a sequence of saddles, each corresponding to linear maps of increasing rank, until reaching a sparse global minimum. We support this conjecture with a theorem for the dynamics between the first two saddles, as well as some numerical experiments.

Saddle-to-Saddle Dynamics in Deep Linear Networks: Small Initialization Training, Symmetry, and Sparsity

TL;DR

Abstract

Saddle-to-Saddle Dynamics in Deep Linear Networks: Small Initialization Training, Symmetry, and Sparsity

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (49)