Table of Contents
Fetching ...

Learning Dynamics of Deep Linear Networks Beyond the Edge of Stability

Avrajit Ghosh, Soo Min Kwon, Rongrong Wang, Saiprasad Ravishankar, Qing Qu

TL;DR

The paper provides a fine-grained analysis of gradient-descent learning dynamics for deep linear networks beyond the edge of stability (EOS). It establishes that, beyond EOS, the training dynamics enter a two-period oscillation within a low-dimensional subspace and that a symmetry-driven balancing gap across layers decays monotonically to zero at EOS, implying implicit regularization toward the flattest minima. Central tools include the singular vector stationary set (SVS) and a rigorous balancing argument showing that deeper networks raise the EOS threshold via depth-dependent sharpness. The work connects these dynamics to phenomena observed in nonlinear networks (e.g., mild sharpening, top-subspace oscillations) and clarifies when shallow models avoid EOS, highlighting the role of top features in driving oscillations. Experiments corroborate the theory on DLNs and illustrate differences from nonlinear landscapes, offering a principled lens on optimization in high-depth regimes.

Abstract

Deep neural networks trained using gradient descent with a fixed learning rate $η$ often operate in the regime of "edge of stability" (EOS), where the largest eigenvalue of the Hessian equilibrates about the stability threshold $2/η$. In this work, we present a fine-grained analysis of the learning dynamics of (deep) linear networks (DLNs) within the deep matrix factorization loss beyond EOS. For DLNs, loss oscillations beyond EOS follow a period-doubling route to chaos. We theoretically analyze the regime of the 2-period orbit and show that the loss oscillations occur within a small subspace, with the dimension of the subspace precisely characterized by the learning rate. The crux of our analysis lies in showing that the symmetry-induced conservation law for gradient flow, defined as the balancing gap among the singular values across layers, breaks at EOS and decays monotonically to zero. Overall, our results contribute to explaining two key phenomena in deep networks: (i) shallow models and simple tasks do not always exhibit EOS; and (ii) oscillations occur within top features. We present experiments to support our theory, along with examples demonstrating how these phenomena occur in nonlinear networks and how they differ from those which have benign landscape such as in DLNs.

Learning Dynamics of Deep Linear Networks Beyond the Edge of Stability

TL;DR

The paper provides a fine-grained analysis of gradient-descent learning dynamics for deep linear networks beyond the edge of stability (EOS). It establishes that, beyond EOS, the training dynamics enter a two-period oscillation within a low-dimensional subspace and that a symmetry-driven balancing gap across layers decays monotonically to zero at EOS, implying implicit regularization toward the flattest minima. Central tools include the singular vector stationary set (SVS) and a rigorous balancing argument showing that deeper networks raise the EOS threshold via depth-dependent sharpness. The work connects these dynamics to phenomena observed in nonlinear networks (e.g., mild sharpening, top-subspace oscillations) and clarifies when shallow models avoid EOS, highlighting the role of top features in driving oscillations. Experiments corroborate the theory on DLNs and illustrate differences from nonlinear landscapes, offering a principled lens on optimization in high-depth regimes.

Abstract

Deep neural networks trained using gradient descent with a fixed learning rate often operate in the regime of "edge of stability" (EOS), where the largest eigenvalue of the Hessian equilibrates about the stability threshold . In this work, we present a fine-grained analysis of the learning dynamics of (deep) linear networks (DLNs) within the deep matrix factorization loss beyond EOS. For DLNs, loss oscillations beyond EOS follow a period-doubling route to chaos. We theoretically analyze the regime of the 2-period orbit and show that the loss oscillations occur within a small subspace, with the dimension of the subspace precisely characterized by the learning rate. The crux of our analysis lies in showing that the symmetry-induced conservation law for gradient flow, defined as the balancing gap among the singular values across layers, breaks at EOS and decays monotonically to zero. Overall, our results contribute to explaining two key phenomena in deep networks: (i) shallow models and simple tasks do not always exhibit EOS; and (ii) oscillations occur within top features. We present experiments to support our theory, along with examples demonstrating how these phenomena occur in nonlinear networks and how they differ from those which have benign landscape such as in DLNs.

Paper Structure

This paper contains 58 sections, 16 theorems, 192 equations, 22 figures.

Key Result

Proposition 1

Consider the deep matrix factorization loss in Equation (eqn:deep_mf). Let $\mathbf{M}_\star = \mathbf{U}_\star \mathbf{\Sigma}_\star \mathbf{V}_\star^\top$ and $\mathbf{W}_\ell(t) = \mathbf{U}_\ell(t) \mathbf{\Sigma}_\ell(t) \mathbf{V}_\ell^\top(t)$ denote the compact SVD for the target matrix and where $\{\mathbf{Q}_\ell\}_{\ell=2}^{L}$ can be any orthogonal matrices.

Figures (22)

  • Figure 1: Bifurcation plot of the oscillations in the singular values (left) and the eigenvalues of the Hessian (right) of a 3-layer end-to-end DLN. The bifurcation plots indicate the existence of a period-doubling route to chaos in DLNs, which we analyze by examining the two-period orbit. Here, $\eta > 2/\beta$ corresponds to the EOS regime, where $\beta = L\sigma_{\star, 1}^{2 - 2/L}$ is the sharpness at the minima, $L$ is the depth of the network and $\sigma_{\star, 1}$ is the first singular value of the target matrix $\mathbf{M}_\star$.
  • Figure 2: Depiction of the two phases of learning in the deep matrix factorization problem for a network of depth $3$. Left: Plot of the training loss undergoing saddle jumps, followed by periodic oscillations. Right: Plot of the corresponding sharpness of the DLN. Upon escaping the first saddle point, the GD iterates enter the edge of the stability regime, where the sharpness hovers just about $2/\eta$.
  • Figure 3: Illustrations of the singular vector and value evolution of the end-to-end DLN starting from the unbalanced initialization. The singular vectors of the network remain static across all iterations, as suggested by the singular vector stationary set, regardless of the learning rate. The angle between the true singular vectors and those of the network remains aligned throughout. The first singular values undergo oscillations in the large $\eta$ regime, whereas they remain constant in the small $\eta$ regime.
  • Figure 4: Illustration of the GD trajectories for three different learning rates regimes for minimizing the function $f(\sigma_1, \sigma_2) = \frac{1}{2}(\sigma_2 \cdot \sigma_1 - \sigma_{*})^2$, starting from an unbalanced initial point. Gradient flow conserves the balancing gap $|\sigma_{1}^{2}(t)-\sigma_{2}^{2}(t)|$ throughout its trajectory. GD at EOS decreases the gap, but stagnates once the oscillations no longer occur. GD beyond EOS decreases the gap monotonically to zero by oscillating towards and about the balanced minimum.
  • Figure 5: Plot of $| \sigma^2_1(t) - \sigma^2_2(t)|$ on a toy example, showing a decaying balancing gap beyond EOS.
  • ...and 17 more figures

Theorems & Definitions (31)

  • Proposition 1: Singular Vector Stationary Set
  • Proposition 2: Balancing of Singular Values
  • Lemma 1: Eigenvalues of Hessian at the Balanced Minimum
  • Theorem 1: Rank-$p$ Periodic Subspace Oscillations
  • proof
  • Proposition 3
  • proof
  • Proposition 4
  • proof
  • Lemma 2: Conservation of Balancedness in GF
  • ...and 21 more