Table of Contents
Fetching ...

The loss landscape of deep linear neural networks: a second-order analysis

El Mehdi Achour, François Malgouyres, Sébastien Gerchinovitz

TL;DR

This work delivers a complete second-order analysis of the loss landscape for deep linear networks with the squared loss. It introduces a rank-based framework that, for any first-order critical point, associates a subspace index and a rank r, enabling a precise classification into global minimizers, strict saddles, and non-strict saddles, and it provides an explicit parameterization of global minimizers. The key novelty is the identification of tightened versus non-tightened pivots, which determines second-order behavior and reveals that non-strict saddles correspond to rank-constrained global minima, thereby connecting second-order geometry to implicit regularization. The results illuminate why gradient-based methods may converge to global solutions or get trapped briefly near flat saddles, and they offer a constructive view that recovers and reframes several prior convergence results for wide or shallow regimes. Overall, the second-order landscape perspective clarifies the structure of critical points and lays a foundation for understanding implicit regularization and convergence in deep linear networks.

Abstract

We study the optimization landscape of deep linear neural networks with the square loss. It is known that, under weak assumptions, there are no spurious local minima and no local maxima. However, the existence and diversity of non-strict saddle points, which can play a role in first-order algorithms' dynamics, have only been lightly studied. We go a step further with a full analysis of the optimization landscape at order 2. We characterize, among all critical points, which are global minimizers, strict saddle points, and non-strict saddle points. We enumerate all the associated critical values. The characterization is simple, involves conditions on the ranks of partial matrix products, and sheds some light on global convergence or implicit regularization that have been proved or observed when optimizing linear neural networks. In passing, we provide an explicit parameterization of the set of all global minimizers and exhibit large sets of strict and non-strict saddle points.

The loss landscape of deep linear neural networks: a second-order analysis

TL;DR

This work delivers a complete second-order analysis of the loss landscape for deep linear networks with the squared loss. It introduces a rank-based framework that, for any first-order critical point, associates a subspace index and a rank r, enabling a precise classification into global minimizers, strict saddles, and non-strict saddles, and it provides an explicit parameterization of global minimizers. The key novelty is the identification of tightened versus non-tightened pivots, which determines second-order behavior and reveals that non-strict saddles correspond to rank-constrained global minima, thereby connecting second-order geometry to implicit regularization. The results illuminate why gradient-based methods may converge to global solutions or get trapped briefly near flat saddles, and they offer a constructive view that recovers and reframes several prior convergence results for wide or shallow regimes. Overall, the second-order landscape perspective clarifies the structure of critical points and lays a foundation for understanding implicit regularization and convergence in deep linear networks.

Abstract

We study the optimization landscape of deep linear neural networks with the square loss. It is known that, under weak assumptions, there are no spurious local minima and no local maxima. However, the existence and diversity of non-strict saddle points, which can play a role in first-order algorithms' dynamics, have only been lightly studied. We go a step further with a full analysis of the optimization landscape at order 2. We characterize, among all critical points, which are global minimizers, strict saddle points, and non-strict saddle points. We enumerate all the associated critical values. The characterization is simple, involves conditions on the ranks of partial matrix products, and sheds some light on global convergence or implicit regularization that have been proved or observed when optimizing linear neural networks. In passing, we provide an explicit parameterization of the set of all global minimizers and exhibit large sets of strict and non-strict saddle points.

Paper Structure

This paper contains 57 sections, 34 theorems, 302 equations, 6 figures.

Key Result

Proposition 1

Suppose Assumption Assump H in Section Settings holds true. Let $\textbf{W} = (W_H, \ldots , W_1)$ be a first-order critical point of $L$ and set $r = \text{rk}(W_H \cdots W_1) \in \llbracket 0,r_{max} \rrbracket$. There exists a unique subset $\mathcal{S} \subset\llbracket 1,d_y \rrbracket$ of size where $U$ was defined in svd de sigma 1/2. We say that the critical point $\textbf{W}$ is associate

Figures (6)

  • Figure 1: Example of a landscape with a plateau (non-strict saddle point).
  • Figure 2: Example of a landscape with a strict saddle point at (0,0).
  • Figure 3: The loss function during the iterative process, when initialized around a strict saddle point (in red) or a non-strict saddle point (in blue).
  • Figure 4: Histogram of escape epochs, when initialized around a strict (in red) or a non-strict saddle point (in blue). For clarity, the $y$-axis is endowed with two scales. The right axis corresponds to the blue curve and the left to the red one.
  • Figure 5: Complementary blocks to the pivot $(i,j)$ .
  • ...and 1 more figures

Theorems & Definitions (38)

  • Proposition 1: Global map and critical values
  • Proposition 2
  • Definition 3: Complementary blocks
  • Proposition 4
  • Definition 5: Tightened pivot
  • Definition 6: Tightened critical point
  • Theorem 7: Classification of the critical points of $L$
  • Proposition 8
  • Proposition 9
  • Proposition 10
  • ...and 28 more